arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1942
2605.20194 2026-05-21 cs.CL cs.AI cs.LG

Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction

并行大语言模型推理用于偏见鲁棒、稳健的概念抽象

Aisvarya Adeseye, Jouni Isoaho, Adeyemi Adeseye

AI总结 本文提出了一种结合并行分块处理与证据锚定整合的结构化框架,旨在减少长文档分析中的偏见、遗漏误差和过度泛化问题,通过并行处理和证据锚定提高文本分析的可靠性和可扩展性。

Comments Accepted to be Published in 12th Intelligent Systems Conference 2026, 3-4 September 2026 in Amsterdam, The Netherlands

详情
AI中文摘要

大型语言模型(LLMs)在分析文本方面被越来越多地使用。然而,当分析长文档时,它们常常受到上下文推理限制的困扰。当长文档被顺序处理时,早期或主导的概念会掩盖不明显但有意义的解释,导致累积分析偏见、遗漏误差和过度泛化。此外,独立生成的输出通常在没有系统基础的情况下合并,引入了冗余、概念漂移和未经支持的主张。本研究提出了一种结合并行分块处理与证据锚定整合的结构化框架。文本首先被划分为语义连贯的分块,并独立并行处理以消除早期处理的影响。然后,独立生成的解释通过显式的证据锚定和优先级整合进行整合,从而减少主导和过度泛化,同时提高可追溯性。在多种模型类型和规模上的实验表明,并行处理显著减少了约84%的遗漏误差,提高了高达130%的证据可追溯性,并减少了高达91%的未经支持的主张。较小的模型受益最大,表明高效的并行分块和整合在实现可靠和可扩展的文本分析中起关键作用。

英文摘要

Large language models (LLMs) have been increasingly used to analyze text. However, they are often plagued with contextual reasoning limitations when analyzing long documents. When long documents are processed sequentially, early or dominant concepts can overshadow less visible but meaningful interpretations, leading to cumulative analytical bias, omission error, and over-generalization. Additionally, independently generated outputs are often merged without systematic grounding, introducing redundancy, conceptual drift, and unsupported claims. This study proposes a structured framework combining parallel chunk-level processing with evidence-anchored consolidation. Texts are first divided into semantically coherent chunks and processed independently in parallel to remove influence from earlier processing. The independently generated interpretations are then consolidated using explicit evidence anchoring and prioritization that reduces dominance and over-generalization while improving traceability. Experiments with multiple model types and sizes indicate that parallel processing significantly reduces omission error by approximately 84%, increases evidence traceability by up to 130%, and reduces unsupported claims by up to 91%. Smaller models benefited most, suggesting that efficient parallel chunking and consolidation play a critical role in achieving reliable and scalable textual analysis.

2605.20193 2026-05-21 cs.CL cs.AI cs.LG

Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

通过多轮提示验证提升量化模型在定性分析中的性能

Aisvarya Adeseye, Jouni Isoaho, Adeyemi Adeseye

AI总结 本文研究了不同位数量化级别和类型对LLaMA-3.1(8B)在定性分析中的性能影响,提出了一种量化感知的多轮提示验证方法以提高模型的稳定性和准确性,结果显示8位模型最接近黄金标准,4位模型在应用方法后变得稳定,3位和2位模型在提示设计和验证后性能有所提升。

Comments Accepted to publish in 12th Intelligent Systems Conference 2026; 3-4 September 2026 in Amsterdam, The Netherlands

详情
AI中文摘要

量化大型语言模型(LLMs)因其运行速度快且计算资源需求低而更常用于定性分析。本研究探讨了不同低位量化级别(8位、4位、3位和2位)和量化类型对LLaMA-3.1(8B)在定性分析中的性能影响。研究使用了82份访谈记录中的专家和非专家回应。低比特模型常产生较高的幻觉和不稳定结果,尤其是在处理非专家语言中的不明确术语时。为提高性能,我们提出了一种量化感知的多轮提示验证方法。该方法通过受控步骤引导模型减少幻觉,移除不可靠内容,并在验证后将结果传递给下一访谈文本,从而提高准确性。为了验证性能,人类编码器使用NVivo和BF16 LLaMA分析了访谈记录。BF16 LLaMA-3.1产生了高精度输出,但存在语义漂移和幻觉。这些错误被手动纠正。纠正后的BF16输出和NVivo人工编码被结合,以创建主题提取和频率分析的黄金标准地面真实值(GSGT)。结果表明,8位模型最接近GSGT。4位模型在应用所提方法后变得稳定。3位和2位模型因压缩严重而性能下降,但通过所提提示设计和验证有所提升。本研究还发现,相同位数的模型在不同量化类型下行为不同。总体而言,该方法帮助低资源LLM变得更加稳定、准确,并以更低的成本适用于定性研究。

英文摘要

Quantized Large Language Models (LLMs) are used more often in qualitative analysis because they run fast and need fewer computing resources. This study examines how different lower bits quantization levels (8-bit, 4-bit, 3-bit, and 2-bit) and quantization types affect the performance of LLaMA-3.1 (8B) on qualitative analysis. The study uses expert and non-expert responses from 82 interview transcripts. Low-bit models often produce higher levels of hallucinations and unstable results, especially when reading non-expert language with unclear terms. To improve performance, we propose a quantization-aware multi-pass prompt verification method. This method guides the model through controlled steps that reduce hallucinations. It removes unreliable content and passes the results to the next transcript after verification, improving accuracy. To validate performance, human coders analyzed transcripts using NVivo and BF16 LLaMA. BF16 LLaMA-3.1 produced high-precision output but had semantic drift and hallucination. These errors were corrected manually. The corrected BF16 output and NVivo human coding were combined to create a gold-standard ground truth (GSGT) for thematic extraction and frequency analysis. The results show that 8-bit models stay closest to the GSGT. The 4-bit models lose accuracy but become stable when the proposed method is applied. The 3-bit and 2-bit models drop in performance because of heavy compression, but they improve with the proposed prompt design and verification. The study also finds that models at the same bit level behave differently depending on quantization type. Overall, the method helps low-resource LLMs become more stable, accurate, and suitable for qualitative research at lower cost.

2605.20191 2026-05-21 cs.CL

Shiny Stories, Hidden Struggles: Investigating the Representation of Disability Through the Lens of LLMs

闪耀的故事,隐藏的挣扎:通过LLM的视角调查残疾人的表现

Marco Bombieri, Simone Paolo Ponzetto, Marco Rospocher

AI总结 本文研究了大型语言模型(LLMs)如何表现残疾人群体,通过模拟残疾人视角生成社交媒体帖子,并与真实残疾人的帖子进行比较,揭示LLMs对残疾人的理想化倾向和潜在的偏见。

Comments Accepted for publication in ACM Transactions on Intelligent Systems and Technology

详情
Journal ref
ACM Trans. Intell. Syst. Technol. (2026)
AI中文摘要

现代大型语言模型(LLMs)近年来因其模拟人类行为和生成反映个人身份和人口群体的文本的能力而受到广泛关注。尽管这些能力可以打开跨领域多样应用的大门,但必须检查这些模型如何代表各种目标群体,因为LLMs可能会延续和放大对历史上被边缘化社区的偏见或歧视,或者由于去偏见努力,过度纠正导致过度积极的刻板印象。这种过度补偿会理想化这些群体,忽视他们所面临的复杂性和挑战,转而采用不现实的描绘。在本文中,我们通过模拟残疾人视角生成社交媒体帖子来研究LLMs如何表现残疾人群体。这些帖子随后与真实残疾人撰写的帖子进行比较,重点是情感基调、情感倾向和代表性的词汇和主题。我们的分析揭示了两个关键发现:(1)LLMs常常理想化残疾人的经历,生成过于积极的刻板印象,尽管看起来鼓舞人心,但未能真实捕捉他们的现实生活;(2)对模拟有和没有残疾人的帖子的比较分析显示了一种负面偏见,某些主题,如职业和娱乐,被不成比例地与非残疾人群体相关联。这强化了排斥性叙述和对残疾的过度理想化描绘,歪曲了该社区实际面临的挑战。这些发现与更广泛的关注和持续研究一致,表明LLMs难以反映社会的多样现实,特别是边缘化群体的复杂经历,并强调了对其表现进行批判性审视的必要性。

英文摘要

Modern Large Language Models (LLMs) have recently attracted much attention for their ability to simulate human behavior and generate text that reflects personas and demographic groups. While these capabilities can open up a multitude of diverse applications across fields, it is crucial to examine how such models represent various target groups since LLMs can perpetuate and amplify biases or discrimination against historically marginalized communities or, alternatively, as a result of debiasing efforts, overcorrect by portraying overly positive stereotypes. This overcompensation can idealize these groups, erasing the complexities and challenges they face in favor of unrealistic depictions. In this paper, we investigate how LLMs represent disability by simulating the perspectives of individuals with disabilities in generating social media posts. These posts are then compared with those written by real people with disabilities, focusing on emotional tone, sentiment, and representative words and themes. Our analysis reveals two key findings: (1) LLMs often idealize the experiences of people with disabilities, producing overly positive stereotypes that, despite appearing uplifting, fail to authentically capture their lived realities; and (2) a comparative analysis of posts simulating individuals with and without disabilities highlights a negative bias, where certain topics, such as career and entertainment, are disproportionately associated with nondisabled individuals. This reinforces exclusionary narratives and over-idealized portrayals of disability, misrepresenting the actual challenges faced by this community. These findings align with broader concerns and ongoing research showing that LLMs struggle to reflect the diverse realities of society, particularly the nuanced experiences of marginalized groups, and underscore the need for critical scrutiny of their representations.

2605.20190 2026-05-21 cs.AI cs.GR

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

具有工具增强的代理用于闭环优化、仿真和建模协调

Liyuan Deng, Shujian Deng, Yongkang Chen, Yongkang Dai, Zhihang Zhong, Linyang Li, Xiao Sun, Yilei Shi, Huaxi Huang

AI总结 本文提出COSMO-Agent框架,通过强化学习使大语言模型完成闭环CAD-CAE流程,解决CAD-CAE语义鸿沟问题,提升工业设计仿真优化的可行性、效率和稳定性。

Comments 8pages,3figures

详情
AI中文摘要

迭代的工业设计-仿真优化受到CAD-CAE语义鸿沟的限制:在多样的、耦合的约束条件下,将仿真反馈转化为有效的几何编辑。为填补这一鸿沟,我们提出了COSMO-Agent(闭环优化、仿真和建模协调),一种具有工具增强的强化学习(RL)框架,该框架教会大语言模型完成闭环CAD-CAE流程。具体来说,我们将CAD生成、CAE求解、结果解析和几何修改视为一个交互式RL环境,其中LLM学习协调外部工具并修改参数化几何体,直到满足约束条件。为了使这种学习稳定且适用于工业应用,我们设计了多约束奖励,共同鼓励可行性、工具链鲁棒性和结构化输出的有效性。此外,我们贡献了一个与行业对齐的数据集,涵盖了25个组件类别,具有可执行的CAD-CAE任务,以支持现实的训练和评估。实验表明,COSMO-Agent训练显著提高了小开源LLM在约束驱动设计中的表现,超过了大开源和强闭源模型在可行性、效率和稳定性方面。

英文摘要

Iterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.

2605.20189 2026-05-21 cs.AI cs.LG

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

SOLAR:一种自优化的开放式自主代理,用于终身学习和持续适应

Nitin Vetcha, Dianbo Liu

AI总结 本文提出SOLAR,一种自优化的开放式自主代理,通过参数级元学习实现自我改进,解决了动态真实世界中概念漂移和梯度基适应成本高的问题,展示了在常识、数学、医学、编程、社交和逻辑推理任务上的优越性能。

Comments Accepted at "Association for the Advancement of Artificial Intelligence 2026 Conference" in Streaming Continual Learning Bridge. Published in CEUR Workshop Proceedings (Original version at https://ceur-ws.org/Vol-4183/paper2.pdf)

详情
Journal ref
CEUR Workshop Proceedings, Vol. 4183, 2026
AI中文摘要

尽管大型语言模型(LLMs)在许多任务上取得了显著成功,但在动态、真实世界环境中部署时仍然面临瓶颈,主要挑战是概念漂移和基于梯度的适应成本高。传统微调(FT)难以适应非平稳数据流,且会导致灾难性遗忘或需要大量人工数据校准。为了解决这些限制,本文在流式和持续学习范式中提出Self-Optimizing Lifelong Autonomous Reasoner(SOLAR),即一种开放式自主代理,利用参数级元学习实现自我改进,将模型权重视为探索的环境。SOLAR通过在常识常识知识上建立强先验,使其在迁移学习中有效。通过多级强化学习方法,SOLAR自主发现适应策略,实现对未见领域的高效测试时间适应。关键在于SOLAR维护一个不断发展的有效修改策略知识库,隐式地作为事件记忆缓冲器,平衡可塑性(适应新任务)和稳定性(保留元知识)。实验表明,SOLAR在常识、数学、医学、编程、社交和逻辑推理任务上优于强基线,标志着向能够适应演进环境的自主代理迈出重要一步。

英文摘要

Despite the remarkable success of large language models (LLMs), they still face bottlenecks while deploying in dynamic, real-world settings with primary challenges being concept drift and the high cost of gradient-based adaptation. Traditional fine-tuning (FT) struggles to adapt to non-stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation. To address these limitations within the streaming and continual learning paradigm, we propose the Self-Optimizing Lifelong Autonomous Reasoner (SOLAR) which is an open-ended autonomous agent that leverages parameter-level meta-learning to self-improve, treating model weights as an environment for exploration. It initiates the process by consolidating a strong prior over common-sense knowledge making it effective for transfer-learning. By utilizing a multi-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test-time adaptation to unseen domains. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity (adaptation to new tasks) and stability (retention of meta-knowledge). Experiments demonstrate that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments.

2605.20188 2026-05-21 cs.LG cs.AI

GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation

GraphDiffMed: 基于药理图先验的知识约束差分注意力用于药物推荐

Krati Saxena, Tomohiro Shibata

AI总结 本文提出GraphDiffMed,一种结合噪声感知注意力和药理约束的药物推荐框架,通过双尺度差分注意力在院内和院间层面过滤虚假信号,提升推荐质量和安全性。

详情
AI中文摘要

从电子健康记录(EHRs)中推荐安全有效的药物组合是核心临床AI问题,但因患者轨迹长、噪声大且临床异质性高而困难。现有方法通常在时间建模或药理知识整合方面表现优异,但难以同时实现两者并有效抑制噪声。我们提出GraphDiffMed,一种基于双尺度差分注意力v2的知识约束药物推荐框架。差分注意力应用于院内和院间层面以过滤遇境内的虚假信号和纵向历史中的噪声,而药理约束则在学习过程中整合。在MIMIC-III和消融研究中,该设计在推荐质量和排名上优于强基线模型,同时实现了更平衡的安全性能。我们进一步发现,最强表现的配置在实验设置下仅使用人口统计辅助特征。总体而言,GraphDiffMed证明了结合噪声感知注意力与药理约束能产生更可靠且具有临床意义的药物推荐。我们开源代码至https://github.com/saxenakrati09/GraphDiffMed。

英文摘要

Recommending safe and effective medication combinations from electronic health records (EHRs) is a core clinical AI problem, yet it remains difficult because patient trajectories are long, noisy, and clinically heterogeneous. Existing methods typically excel at either temporal modeling across visits or pharmacological knowledge integration (e.g., drug-drug interactions, DDIs), but rarely achieve both while robustly suppressing noise. We present GraphDiffMed, a knowledge-constrained medication recommendation framework built on dual-scale Differential Attention v2. Differential attention is applied at both intra-visit and inter-visit levels to filter spurious signals within encounters and across longitudinal history, while pharmacological constraints are incorporated during learning. Experiments on MIMIC-III and ablation studies show that this design consistently improves recommendation quality and ranking over strong baselines while achieving a more favorable safety performance balance. We further find that the strongest-performing configuration uses only demographic auxiliary features under our experimental setting. Overall, GraphDiffMed demonstrates that combining noise-aware attention with pharmacological constraints yields more reliable and clinically meaningful medication recommendation. We open-source our code at https://github.com/saxenakrati09/GraphDiffMed.

2605.20187 2026-05-21 cs.LG cs.AI cs.IT math.IT

Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models

在遮蔽离散序列模型中神经估计成对互信息

Jai Sharma, Yifan Wang, Bryan Li

AI总结 本文提出了一种神经框架,直接从预训练的遮蔽扩散模型(MDMs)的隐藏状态中估计成对条件互信息(MI),利用模型自身条件分布计算的地面真实MI进行监督,从而捕捉模型内部对依赖结构的信念,并在单次前向传递中预测完整的MI矩阵,实现MI引导的并行解码。

Comments 6 pages, 3 figures; submitting to ICML 2026

详情
AI中文摘要

理解变量之间的依赖关系对于解释性和高效生成在遮蔽扩散模型(MDMs)中至关重要,但这些模型主要暴露边际条件分布,而不显式表示变量间依赖。我们提出了一种神经框架,直接从预训练MDM的隐藏状态中估计成对条件互信息(MI),使用模型自身条件分布计算的地面真实MI进行监督。所得到的估计器捕捉了模型内部对依赖结构的信念,并在单次前向传递中预测完整的MI矩阵,从而通过识别条件独立的变量子集实现MI引导的并行解码。我们在Sudoku和蛋白质序列生成中使用ESM-C评估了我们的方法,其中MI图恢复了已知的结构约束,并在保持生成质量的同时,相比顺序解码将推理时间前向传递次数减少了3-5倍,同时优于基于熵的并行化方法。

英文摘要

Understanding dependencies between variables is critical for interpretability and efficient generation in masked diffusion models (MDMs), yet these models primarily expose marginal conditional distributions and do not explicitly represent inter-variable dependence. We propose a neural framework for estimating pairwise conditional mutual information (MI) directly from the hidden states of a pretrained MDM, using ground-truth MI computed from the model's own conditional distributions for supervision. The resulting estimator captures the model's internal belief about dependency structure and predicts the full MI matrix in a single forward pass, enabling MI-guided parallel decoding by identifying conditionally independent subsets of variables. We evaluate our approach on Sudoku and protein sequence generation with ESM-C, where the MI maps recover known structural constraints and enable a 3-5x magnitude reduction in inference-time forward passes compared to sequential decoding, while preserving generative quality and outperforming entropy-based parallelization methods.

2605.17630 2026-05-21 cs.CV

SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation

SegRAG: 无需训练的检索增强语义分割

Abderrahmene Boudiaf, Irfan Hussain, Sajid Javed

AI总结 本文提出SegRAG,一种无需训练的检索增强语义分割框架,通过结合类特定点提示和文本信息,提升分割性能,在多个基准测试中取得显著提升。

详情
AI中文摘要

开放词汇分割模型如SAM3通过文本提示在广泛类别上表现良好,但当目标类别在预训练中视觉表示不足或偏离标准描述时性能下降,文本提示无法解决空间问题。我们提出了SegRAG,一种无需训练的检索增强分割框架,通过从精心编纂的DINOv3特征库中提取类特定点提示来增强SAM3。离线时,从注释参考中提取密集的块级描述符,并通过类内凝聚力蒸馏(ICCD)过滤,仅保留能可靠检索同类前景的原型。在推理时,地形相似性接地(TSG)计算与检索原型的余弦相似性景观,通过连通组件分析识别相干的高置信度区域,并通过非极大值抑制提取峰值位置。所得点提示与类名文本共同传递,通过单次SAM3前向传递。在四个标准基准测试中,SegRAG在文本基线基础上持续优于,LVIS上提升至+3.92 mIoU。在零样本领域迁移的AgML农业基准测试中,其均IoU从25.27提升至59.24(+33.97),并恢复个体类别的mIoU从零到超过95。消融实验证实ICCD、TSG和联合提示各自独立贡献,组合时效果更佳。代码可在(https://github.com/boudiafA/SegRAG)获取。

英文摘要

Open-vocabulary segmentation models such as SAM3 perform well across broad categories via text prompting, yet degrade when target classes are visually underrepresented in pretraining or depart from canonical depictions-limitations text prompts cannot resolve spatially. We present SegRAG, a training-free retrieval-augmented segmentation framework that grounds SAM3 with class-specific point prompts derived from a curated DINOv3 feature bank. Offline, dense patch-level descriptors are extracted from annotated references and filtered by Intra-Class Cohesion Distillation (ICCD), retaining only prototypes that reliably retrieve within-class foreground. At inference, Topographic Similarity Grounding (TSG) computes a cosine-similarity landscape against retrieved prototypes, identifies coherent high-confidence regions via connected-component analysis, and extracts peak locations through non-maximum suppression. The resulting point prompts are delivered jointly with class-name text in a single SAM3 forward pass. On four standard benchmarks, SegRAG consistently outperforms the text-only baseline, gaining up to +3.92 mIoU on LVIS. On AgML agricultural benchmarks under zero-shot domain transfer, it raises mean IoU from 25.27 to 59.24 (+33.97) and recovers individual classes from zero to over 95 mIoU. Ablations confirm that ICCD, TSG, and joint prompting each contribute independently and compound when combined. Code is available at (https://github.com/boudiafA/SegRAG).

2605.17568 2026-05-21 cs.LG

Structured Neural Marked Point Processes for Interpretable Event Interaction Modeling

结构化神经标记点过程用于可解释的事件交互建模

Zhitong Xu, Qiwei Yuan, Yinghao Chen, Shandian Zhe, Bin Shen

AI总结 本文提出了一种结构化神经标记点过程(SNMPP),通过显式发现事件级和类别级的关系,实现高灵活性的建模,同时在合成和现实数据集上验证了其揭示结构关系和强预测性能的能力。

详情
AI中文摘要

多类事件流在许多现实世界应用中出现,其中揭示结构化、可解释的事件间关系,以及准确预测,仍然是一个核心挑战。现有的神经点过程模型具有高度表达能力,但以黑箱方式编码事件交互,阻止了显式发现结构依赖关系。在本文中,我们提出了一种结构化神经标记点过程(SNMPP),在实现高建模灵活性的同时,能够从数据中显式发现事件级和类别级的关系。我们的模型构建了一个由事件类型上的符号交互网络和延迟感知的单调时间网络组成的产品形式神经影响核。这种设计使能够显式表征类间影响拓扑结构--包括激发、抑制和中性--同时灵活捕捉多样的时间衰减模式和潜在影响延迟。为了高效学习,我们开发了一种分层蒙特卡洛估计器用于随机训练。在合成和现实世界基准数据集上的广泛实验验证了我们的方法揭示结构关系和提供强预测性能的能力。

英文摘要

Multi-class event streams arise in numerous real-world applications, where uncovering structured, interpretable inter-event relationships, together with accurate prediction, remains a central challenge. Existing neural point process models are highly expressive but encode event interactions in a black-box manner, preventing explicit discovery of structured dependencies. In this paper, we propose a structured neural marked point process (SNMPP) that achieves high modeling flexibility while enabling explicit event-wise and class-wise relationship discovery from data. Our model constructs a product-form neural influence kernel composed of a signed interaction network over event types and a delay-aware monotonic temporal network. This design enables explicit characterization of inter-class influence topology -- including excitation, inhibition, and neutrality -- while flexibly capturing diverse temporal decay patterns and potential influence delays. For efficient learning, we develop a stratified Monte Carlo estimator for stochastic training. Extensive experiments on synthetic and real-world benchmark datasets validate the ability of our approach to uncover structured relationships and deliver strong predictive performance.

2605.17140 2026-05-21 cs.CV cs.AI cs.CL

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

UCSF-PDGM-VQA: 用于脑肿瘤MRI解读的视觉问答数据集

Shiv Ghosh, Junayd Lateef, Chih-Hua Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil

AI总结 本文提出一个临床相关的视觉问答基准数据集UCSF-PDGM-VQA,包含2387个问题-答案对,用于评估视觉语言模型在处理多序列3D MRI扫描中的能力,发现现有模型在多模态处理上存在缺陷。

Comments 10 pages, 2 figures, 6 tables

详情
AI中文摘要

脑肿瘤诊断很大程度上依赖于磁共振成像(MRI)评估,这需要放射科医生综合分析成千上万张来自多种3D序列和纵向研究的图像。这一过程需要高级的神经放射学培训,具有显著的认知负荷,并且非常耗时。尽管放射学需求不断增长,但这种专业知识难以扩展,给当前的医疗系统带来压力。视觉-语言模型(VLMs)提供了一种通过半自动化、互动解释复杂脑MRI来减轻这种负担的机会。然而,由于缺乏专门的评估基准,它们在神经肿瘤学中目前使用有限。我们介绍了一个临床相关的视觉问答(VQA)基准——UCSF-PDGM-VQA数据集,包含来自公共UCSF-PDGM数据集中473个胶质瘤相关MRI研究的2387个QA对。我们进一步在该数据集上建立了六种最先进的视觉语言模型(VLMs)和一个大型语言模型的性能基线。我们发现,当前模型无法有效处理多序列、三维MRI扫描,导致视觉特征的抑制和对语言先验的过度依赖,从而造成模态崩溃。这些发现突显了当前模型在临床环境中的可靠性和安全性方面的关键缺陷,需要开发稳健的、领域特定的VLMs。

英文摘要

Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.

2605.15419 2026-05-21 cs.LG

Lagrangian Flow Matching: A Least-Action Framework for Principled Path Design

Lagrangian Flow Matching: 一个基于最小作用原理的路径设计框架

Shukai Du, Junzhe Zhang, Yiming Li

AI总结 本文提出了一种基于最小作用原理的路径设计框架,通过最小化一般拉格朗日量的作用来确定概率路径和速度场,展示了其在最优传输和条件流匹配中的应用。

详情
AI中文摘要

流匹配通过回归来训练神经速度场,以对抗与给定概率路径相关的靶向速度,该路径连接了一个简单的初始分布到数据分布。核心设计选择是路径本身。现有的构造,包括修正和基于最优传输的路径,将样本沿直线运输在耦合端点之间,因此只覆盖了狭义的动力学类别。我们观察到,这对应于经典力学中最小作用原理的最简单情况,其中动能拉格朗日量产生自由粒子的直线轨迹。基于这一观察,我们提出了拉格朗日流匹配,一个基于物理的框架,其中概率路径和速度场由最小化一般拉格朗日量的作用确定,同时满足连续性方程和给定的端点。我们展示了这一动态问题等价于静态最优传输(OT)的公式化,从而得到一组无模拟训练目标,能够恢复最优传输基于的流匹配作为动能特例,以及三角函数保持方差的扩散路径作为谐振子情况。更一般的拉格朗日量产生新的概率路径和速度场,并且数值实验表明,它们在学习动力学中引起有意义的变化,同时仍能与现有的条件流匹配模型竞争。

英文摘要

Flow matching trains a neural velocity field by regression against a target velocity associated with a prescribed probability path connecting a simple initial distribution to the data distribution. A central design choice is the path itself. Existing constructions, including rectified and optimal-transport-based paths, transport samples along straight lines between coupled endpoints and thus cover only a narrow class of dynamics. We observe that this corresponds to the simplest case of the least-action principle in classical mechanics, in which the kinetic Lagrangian yields free-particle straight-line trajectories. Building on this observation, we propose Lagrangian flow matching, a physics-based framework in which the probability path and velocity field are determined by minimizing the action of a general Lagrangian subject to the continuity equation and the prescribed endpoints. We show that this dynamic problem admits an equivalent static optimal transport (OT) formulation, yielding a family of simulation-free training objectives that recover OT-based flow matching as the kinetic special case and the trigonometric variance-preserving diffusion path as the harmonic-oscillator case. More general Lagrangians give rise to new probability paths and velocity fields, and numerical experiments show that they induce meaningful changes in the learned dynamics while remaining competitive with existing conditional flow matching models.

2605.12770 2026-05-21 cs.LG cs.AI cs.CL

WriteSAE: Sparse Autoencoders for Recurrent State

WriteSAE: 用于递归状态的稀疏自编码器

Jack Young

AI总结 本文提出WriteSAE,一种用于递归语言模型状态中矩阵更新的稀疏自编码器,通过在递归缓存中替换原始写入操作来提升生成效果,并在多个模型上验证了其有效性。

Comments 26 pages, 14 figures, 21 tables; code at https://github.com/JackYoung27/writesae

详情
AI中文摘要

我们介绍了WriteSAE,一种用于递归语言模型状态中矩阵更新的稀疏自编码器。在Gated DeltaNet、Mamba-2和RWKV-7中,每个token向递归缓存写入一个矩阵形状的更新;残差流SAE具有向量形状的原子,无法直接替换该更新。WriteSAE学习具有与模型自身写入相同形状的秩-1矩阵原子。这使我们能够测试直接替换:在SAE激活原子的位置,我们移除模型的写入,插入由SAE激活缩放的原子,并继续前向传递。在92.4%的评估位置上,原子比删除写入能产生更接近的最终token分布;平均每个原子,该比率是89.8%。对于Gated DeltaNet,一个使用忘记门、读取查询和输出嵌入的公式可以预测结果的logit变化,$R^2 = 0.98$。相同的替换测试在Mamba-2-370M上转移,达到88.1%。在生成中,该公式选择写入方向;将写入方向写入三个连续的缓存位置,其范数为模型写入的3倍,使在未修改模型中初始排名为100-1000的token出现在100%的延续中,比33.3%有所提高。据我们所知,这是首次在状态空间或混合递归层中报告的缓存级引导干预。

英文摘要

We introduce WriteSAE, a sparse autoencoder for the matrix updates written into recurrent language-model state. In Gated DeltaNet, Mamba-2, and RWKV-7, each token writes a matrix-shaped update to a recurrent cache; a residual-stream SAE has vector-shaped atoms and cannot replace that update directly. WriteSAE learns rank-1 matrix atoms with the same shape as the model's own write. This lets us test a direct replacement: at positions where the SAE activates an atom, we remove the model's write, insert the atom scaled by its SAE activation, and continue the forward pass. The atom gives a closer final token distribution than deleting the write on 92.4% of evaluated positions; averaged per atom, the rate is 89.8%. For Gated DeltaNet, a formula using the forget gate, read query, and output embedding predicts the resulting logit change with $R^2 = 0.98$. The same replacement test transfers to Mamba-2-370M at 88.1%. In generation, the formula chooses a write direction; writing it into three consecutive cache positions at $3\times$ the norm of the model's write makes tokens initially ranked 100--1000 by the unmodified model appear in 100% of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention reported in a state-space or hybrid recurrent layer.

2605.06395 2026-05-21 cs.LG cs.AI eess.SP

Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves

通过希尔伯特丛和细胞sheaf实现一致的几何深度学习

Kartik Tandon, Julian Gould, Tanishq Bhatia, Francesca Dominici, Alejandro Ribeiro, Claudio Battiloro

AI总结 本文提出了一种新的卷积学习框架,用于在流形上支持的可能无限维信号,通过希尔伯特丛关联的连接拉普拉斯算子作为卷积算子,引入了称为HilbNets的滤波器和神经网络,并通过两阶段采样过程实现,证明了采样诱导的希尔伯特细胞sheaf的sheaf拉普拉斯收敛于底层连接拉普拉斯,从而在无限维丛设置中推广了Belkin和Niyogi的收敛结果,最终在合成和现实任务中验证了该框架。

Comments 51 pages, 3 figures, 5 tables

详情
AI中文摘要

现代深度学习架构越来越多地面临复杂信号的挑战,这些信号本质上是无限维的,如时间序列、概率分布或算子,并在不规则域上定义。然而,针对这些设置的统一学习理论仍然缺乏。为了开始解决这一差距,我们引入了一种新的卷积学习框架,用于在流形上支持的可能无限维信号。具体来说,我们使用与希尔伯特丛相关的连接拉普拉斯算子作为卷积算子,并推导出滤波器和神经网络,称为HilbNets。我们使HilbNets以及更一般地卷积操作通过两阶段采样过程实现。首先,我们证明采样流形诱导了一个希尔伯特细胞sheaf,这是一个带有希尔伯特特征空间和边耦合规则的广义图结构,并证明其sheaf拉普拉斯在采样密度增加时以概率收敛于底层连接拉普拉斯。值得注意的是,这一结果是Belkin & Niyogi收敛结果在无限维丛设置中的推广,这是几何学习方法的理论基石。其次,我们离散化信号并证明离散化的(可实现的)HilbNets收敛于底层连续架构,并且可以在相同丛的不同采样中转移,从而为学习提供一致性。最后,我们验证了我们的框架在合成和现实任务中的有效性。总体而言,我们的结果通过将经典拉普拉斯框架提升到信号在每个点居住在自身希尔伯特空间的设置中,扩展了几何学习的范围。

英文摘要

Modern deep learning architectures increasingly contend with sophisticated signals that are natively infinite-dimensional, such as time series, probability distributions, or operators, and are defined over irregular domains. Yet, a unified learning theory for these settings has been lacking. To start addressing this gap, we introduce a novel convolutional learning framework for possibly infinite-dimensional signals supported on a manifold. Namely, we use the connection Laplacian associated with a Hilbert bundle as a convolutional operator, and we derive filters and neural networks, dubbed as \textit{HilbNets}. We make HilbNets and, more generally, the convolution operation, implementable via a two-stage sampling procedure. First, we show that sampling the manifold induces a Hilbert Cellular Sheaf, a generalized graph structure with Hilbert feature spaces and edge-wise coupling rules, and we prove that its sheaf Laplacian converges in probability to the underlying connection Laplacian as the sampling density increases. Notably, this result is a generalization to the infinite-dimensional bundle setting of the Belkin \& Niyogi \cite{BELKIN20081289} convergence result for the graph Laplacian to the manifold Laplacian, a theoretical cornerstone of geometric learning methods. Second, we discretize the signals and prove that the discretized (implementable) HilbNets converge to the underlying continuous architectures and are transferable across different samplings of the same bundle, providing consistency for learning. Finally, we validate our framework on synthetic and real-world tasks. Overall, our results broaden the scope of geometric learning as a whole by lifting classical Laplacian-based frameworks to settings where the signal at each point lives in its own Hilbert space.

2605.03601 2026-05-21 cs.LG cs.DM math.CO

Most ReLU Networks Admit Identifiable Parameters

大多数ReLU网络允许可识别的参数

Moritz Grillo, Guido Montúfar

AI总结 研究ReLU深度网络的实现映射,探讨函数是否能确定其参数(除缩放和排列外),引入基于加权多面体复形的框架,证明对于输入和隐藏层宽度至少为2的架构,存在可识别参数的开集,且函数维度等于参数数量减去隐藏神经元数量,并建立通用深度层次。

详情
AI中文摘要

我们研究深度ReLU网络的实现映射,聚焦于何时函数能确定其参数(除缩放和排列外)。为了分析超越这些标准对称性的隐藏冗余,我们引入基于加权多面体复形的框架。我们的主要结果表明,对于输入和隐藏层宽度至少为2的每个架构,存在可识别参数的开集。这表明此类架构的函数维度恰好等于参数数量减去隐藏神经元数量。我们进一步证明,最小函数表示仍可能具有非平凡的参数冗余。最后,我们建立了通用深度层次,即对于参数的开集,所实现的函数无法由任何更浅的网络泛化表示。

英文摘要

We study the realization map of deep ReLU networks, focusing on when a function determines its parameters up to scaling and permutation. To analyze hidden redundancies beyond these standard symmetries, we introduce a framework based on weighted polyhedral complexes. Our main result shows that for every architecture whose input and hidden layers have width at least two, there exists an open set of identifiable parameters. This implies that the functional dimension of every such architecture is exactly the number of parameters minus the number of hidden neurons. We further show that minimal functional representations can still have non-trivial parameter redundancies. Finally, we establish a generic depth hierarchy, whereby for an open set of parameters the realized function cannot be represented generically by any shallower network.

2605.03562 2026-05-21 cs.LG cs.AI

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

HeadQ: KV-Cache量化中的模型可见失真与分数空间校正

Jorge L. Ruiz Williams

AI总结 本文提出HeadQ方法,通过在键侧存储低秩残差侧码并在校准学习的查询基上应用作为加性对数修正,以解决KV缓存量化中的模型可见失真问题,并通过分数空间误差预测注意力KL散度,优于原始键MSE。

Comments Withdrawn by the author because ethical concerns were identified after posting

详情
AI中文摘要

KV缓存量化器通常优化存储空间重建,尽管注意力通过logits读取键,通过注意力加权读出读取值。我们主张应以模型可见坐标测量持久缓存误差。对于键,可见对象是分数误差模常数位移;这导致HeadQ,一种键侧方法,存储一个低秩残差侧码在校准学习的查询基上,并将其作为加性对数修正。对于值,固定注意力读出提供了一个A²加权的token失真替代物。在六个模型上,Fisher/分数空间误差预测注意力KL散度比原始键MSE更好;相同预算的反例、空空间干预、查询-PCA控制以及错误符号HeadQ否定了存储MSE替代方案。匹配的Pythia检查点将主要异常定位到小模型低熵路由翻转边界。在仅使用键的WikiText-103解码实验中,使用密集值时,HeadQ在最强的2位行中移除了约84-94%的额外困惑度;在辅助的全KV 2位组合中,HeadQ加上A²值策略改进了所有六个模型。

英文摘要

KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visible coordinates. For keys, the visible object is score error modulo constant shifts; this yields HeadQ, a key-side method that stores a low-rank residual side code in a calibration-learned query basis and applies it as an additive logit correction. For values, fixed-attention readout gives an $A^2$-weighted token-distortion surrogate. Across six models, Fisher/score-space error predicts attention KL far better than raw key MSE; same-budget counterexamples, null-space interventions, query-PCA controls, and wrong-sign HeadQ falsify storage-MSE alternatives. Matched Pythia checkpoints localize the main anomaly to a small-model low-entropy route-flip boundary. In K-only WikiText-103 decode experiments with dense values, HeadQ removes roughly $84$--$94\%$ of the excess perplexity on the strongest 2-bit rows; in an auxiliary full-KV 2-bit composition, HeadQ plus an $A^2$ value policy improves all six models.

2604.24957 2026-05-21 cs.LG cs.AI

Compute Aligned Training: Optimizing for Test Time Inference

计算对齐训练:优化测试时间推断

Adam Ousherovitch, Ambuj Tewari

AI总结 本文提出计算对齐训练方法,通过将训练目标与测试时间策略对齐,提升大语言模型在测试时的推断性能。

详情
AI中文摘要

在测试时间计算方面扩大模型性能已成为增强大型语言模型(LLM)性能的强大机制。然而,标准的后训练范式,监督微调(SFT)和强化学习(RL),优化基础策略下单个样本的似然,导致与依赖聚合或过滤输出的测试时间过程产生不一致。在本文中,我们提出计算对齐训练,将训练目标与测试时间策略对齐。通过将推理策略视为对基础策略的操作,我们推导出新的损失函数,这些损失函数在应用所述策略时最大化性能。我们为SFT和RL在常见测试时间策略下实例化此类损失函数。最后,我们提供了实证证据,证明这种训练方法在测试时间扩展方面显著优于标准训练。

英文摘要

Scaling test-time compute has emerged as a powerful mechanism for enhancing Large Language Model (LLM) performance. However, standard post-training paradigms, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), optimize the likelihood of individual samples under a base policy, creating a misalignment with test time procedures that rely on aggregated or filtered outputs. In this work, we propose Compute Aligned Training, which aligns training objectives with test-time strategies. By conceptualizing inference strategies as operators on the base policy, we derive new loss functions that maximize performance when said strategies are applied. We instantiate such loss functions for SFT and RL across common test time strategies. Finally, we provide empirical evidence that this training method substantially improves test time scaling over standard training.

2604.15038 2026-05-21 cs.LG cs.AI cs.CV

When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

当公平性指标产生分歧:评估机器学习中人口公平性评估的可靠性

Khalid Adnan Alsayed

AI总结 本文研究了公平性评估的一致性问题,通过多指标分析评估机器学习模型中的人口偏见,发现不同公平性指标可能导致矛盾的评估结果,引入了公平性分歧指数(FDI)来量化指标间的不一致程度。

Comments 15 pages, 4 figues, 5 tables

详情
AI中文摘要

在高风险应用中,机器学习系统的公平性评估已成为核心问题,包括生物识别、医疗决策和自动风险评估。现有方法通常依赖少量公平性指标来评估模型行为,隐含假设这些指标能提供一致和可靠的结论。然而,不同公平性指标捕捉模型性能的不同统计属性,可能在相同系统上产生冲突的评估。本文通过系统性的多指标分析,评估机器学习模型中的人口偏见,使用面部识别作为受控实验环境,评估模型在多个群体分区下的性能,包括误差率差异和基于性能的指标。结果表明,公平性评估可能因指标选择而显著变化,导致关于模型偏见的矛盾结论。为量化此现象,我们引入公平性分歧指数(FDI),以捕捉公平性指标间的不一致程度。进一步表明,分歧在阈值和模型配置下仍保持高位。这些发现突显了当前公平性评估实践的关键限制,并表明单一指标报告不足以可靠地评估偏见。

英文摘要

The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of commonly used fairness metrics, including error-rate disparities and performance-based measures. Our results demonstrate that fairness assessments can vary significantly depending on the choice of metrics, leading to contradictory conclusions regarding model bias. To quantify this phenomenon, we introduce the Fairness Disagreement Index (FDI), a measure designed to capture the degree of inconsistency across fairness metrics. We further show that disagreement remains high across thresholds and model configurations. These findings highlight a critical limitation in current fairness evaluation practices and suggest that single-metric reporting is insufficient for reliable bias assessment.

2604.10784 2026-05-21 cs.AI

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

TorchUMM: 一个用于评估、分析和训练后处理的统一多模态模型代码库

Yinyi Luo, Wenwen Wang, Hayes Bai, Hongyu Zhu, Hao Chen, Pan He, Marios Savvides, Sharon Li, Jindong Wang

AI总结 本文提出TorchUMM,一个统一的多模态模型代码库,用于评估、分析和训练后处理,涵盖多种多模态模型架构、任务和数据集,通过统一接口和标准化评估协议,促进异构模型的公平比较和深入理解,推动更强大的统一多模态系统的发展。

Comments Technical Report

详情
AI中文摘要

近年来,统一多模态模型(UMMs)的进展导致了大量能够跨视觉和文本模态进行理解、生成和编辑的架构 proliferation。然而,开发统一的UMMs框架仍然具有挑战性,因为模型架构的多样性以及训练范式和实现细节的异质性。在本文中,我们提出了TorchUMM,这是第一个统一的代码库,用于在多种UMMs backbones、任务和数据集上进行全面的评估、分析和训练后处理。TorchUMM支持广泛的模型,涵盖各种规模和设计范式。我们的基准涵盖了三个核心任务维度:多模态理解、生成和编辑,并整合了已建立和新的数据集来评估感知、推理、组合性和遵循指示的能力。通过提供统一的接口和标准化的评估协议,TorchUMM使异构模型之间的公平和可重复比较成为可能,并促进了对它们优缺点的深入理解,从而促进更强大的统一多模态系统的发展。代码可在:https://github.com/AIFrontierLab/TorchUMM 上获得。

英文摘要

Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.

2604.09174 2026-05-21 cs.CL

Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

面向证据不确定性和幻觉的面级追踪(RAG)

Passant Elchafei, Monorama Swain, Shahed Masoudian, Markus Schedl

AI总结 本文提出了一种面向问答任务的面级诊断框架,通过分解每个输入问题为原子推理面,评估证据充分性和基础性,揭示RAG系统中幻觉产生的根源在于生成过程中证据整合的方式,而非检索准确性。

详情
AI中文摘要

检索增强生成(RAG)旨在通过检索证据来减少幻觉,但即使有相关文档可用,幻觉答案仍然常见。现有评估集中在答案级或段落级准确性,对生成过程中证据使用缺乏深入理解。本文引入了一种面向问答任务的面级诊断框架,将每个输入问题分解为原子推理面。对于每个面,我们使用结构化的Facet x Chunk矩阵评估证据充分性和基础性,该矩阵结合检索相关性和基于自然语言推理的忠实度评分。为了诊断证据使用,我们分析了三种受控推理模式:严格RAG,强制依赖检索证据;软RAG,允许整合检索证据和参数知识;以及无检索的LLM生成。比较这些模式能够深入分析检索生成不一致,即相关证据被检索但未在生成过程中正确整合的情况。在医疗问答和HotpotQA上,我们评估了三种开源和闭源LLM(GPT、Gemini和LLaMA),提供了可解释的诊断,揭示了反复出现的面级失败模式,包括证据缺失、证据不一致和先验驱动覆盖。我们的结果表明,RAG系统中的幻觉更多由生成过程中证据整合的方式决定,而非检索准确性,面级分析揭示了系统性的证据覆盖和不一致模式,这些模式在答案级评估下是隐藏的。

英文摘要

Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

2604.01449 2026-05-21 cs.AI cs.LG

When AI Gets it Wrong: Reliability and Risk in AI-Assisted Medication Decision Systems

当AI出错时:AI辅助用药决策系统中的可靠性与风险

Khalid Adnan Alsayed

AI总结 本文研究了AI辅助用药系统在现实决策中的可靠性问题,通过模拟药物相互作用和剂量决策场景,分析系统故障类型及其潜在临床影响,强调在安全关键领域如药房实践中,需补充传统性能指标的风险意识评估方法。

Comments 9 pages, 1 figure. Position paper with simulated experimental analysis of AI reliability in medication decision systems. Minor Correction to Title Metadata (Typo Fix)

详情
AI中文摘要

人工智能(AI)系统日益被整合到医疗和药房工作中,支持药物推荐、剂量确定和药物相互作用检测等任务。尽管这些系统在标准评估指标下通常表现良好,但其在现实决策中的可靠性仍不够理解。在高风险领域如用药管理中,单个错误推荐可能导致严重患者伤害。本文通过聚焦系统故障及其潜在临床后果,探讨AI辅助用药系统的可靠性。不同于仅通过聚合指标评估性能,本文关注错误发生的方式以及AI系统产生错误输出时的情况。通过一系列受控的模拟场景,分析不同类型的系统故障,包括遗漏相互作用、错误风险标记和不适当的剂量推荐。研究发现,AI在用药相关情境中的错误可能导致不良药物反应、无效治疗或延误护理,尤其是在缺乏充分人类监督的情况下。此外,本文讨论了过度依赖AI推荐的风险以及决策过程透明度有限带来的挑战。本文为医疗领域AI评估提供了以可靠性为核心的视角,强调理解故障行为和现实影响的重要性。它突显了在安全关键领域如药房实践中,需补充传统性能指标的风险意识评估方法的必要性。

英文摘要

Artificial intelligence (AI) systems are increasingly integrated into healthcare and pharmacy workflows, supporting tasks such as medication recommendations, dosage determination, and drug interaction detection. While these systems often demonstrate strong performance under standard evaluation metrics, their reliability in real-world decision-making remains insufficiently understood. In high-risk domains such as medication management, even a single incorrect recommendation can result in severe patient harm. This paper examines the reliability of AI-assisted medication systems by focusing on system failures and their potential clinical consequences. Rather than evaluating performance solely through aggregate metrics, this work shifts attention towards how errors occur and what happens when AI systems produce incorrect outputs. Through a series of controlled, simulated scenarios involving drug interactions and dosage decisions, we analyse different types of system failures, including missed interactions, incorrect risk flagging, and inappropriate dosage recommendations. The findings highlight that AI errors in medication-related contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, particularly when systems are used without sufficient human oversight. Furthermore, the paper discusses the risks of over-reliance on AI recommendations and the challenges posed by limited transparency in decision-making processes. This work contributes a reliability-focused perspective on AI evaluation in healthcare, emphasising the importance of understanding failure behavior and real-world impact. It highlights the need to complement traditional performance metrics with risk-aware evaluation approaches, particularly in safety-critical domains such as pharmacy practice.

2603.28675 2026-05-21 cs.CV cs.AI cs.LG

Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

为何聚合准确率不足以评估执法面部识别系统的公平性

Khalid Adnan Alsayed

AI总结 本文探讨了在执法场景中,面部识别系统的聚合准确率作为公平性评估指标的不足,通过分析子群体误差分布,指出聚合指标可能掩盖不同群体间的显著差异,并强调需要更全面的评估框架来确保负责任的AI部署。

Comments 9 pages, 2 tables, 1 figure. Position paper with empirical subgroup analysis highlighting limitations of aggregate accuracy in fairness evaluation

详情
AI中文摘要

面部识别系统正在越来越多地应用于执法和安全领域,在这些领域中算法决策可能带来重大社会影响。尽管报告的准确率较高,但越来越多的证据表明,这些系统在不同群体中的表现往往不均衡,导致不公正的误差率和潜在危害。本文认为,聚合准确率是评估执法中面部识别系统公平性和可靠性不足的指标。通过分析子群体层面的误差分布,包括假阳性率(FPR)和假阴性率(FNR),本文展示了聚合性能指标如何掩盖不同群体间的关键差异。实证观察表明,具有相似总体准确率的系统可以表现出显著不同的公平性特征,子群体误差率在单一聚合指标下可能有显著差异。本文进一步探讨了在执法应用中以准确率为中心的评估实践所带来的操作风险,其中误分类可能导致错误怀疑或遗漏识别。它强调了公平性意识评估方法和模型无关审计策略的重要性,这些方法能够实现部署后的现实系统评估。研究结果强调了需要超越准确率作为主要指标,并采用更全面的评估框架来确保负责任的AI部署。

英文摘要

Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.

2603.15842 2026-05-21 cs.LG cs.AI cs.IT math.IT

Informationally Compressive Anonymization: Non-Degrading Sensitive Input Protection for Privacy-Preserving Supervised Machine Learning

信息压缩匿名化:非降级的敏感输入保护用于隐私保护的监督机器学习

Jeremy J Samuelson

AI总结 本文提出了一种信息压缩匿名化(ICA)方法和VEIL架构,通过架构和数学设计而非噪声注入或密码学来实现强隐私保障,确保在隐私保护监督机器学习中保留预测效用,同时支持可扩展的多地区部署。

Comments 47 pages, 29 figures

详情
AI中文摘要

现代机器学习系统越来越多地依赖敏感数据,这带来了显著的隐私、安全和监管风险,而现有的隐私保护机器学习(ppML)技术,如差分隐私(DP)和同态加密(HE),只能通过降级性能、增加复杂性或禁止性计算开销来解决。本文介绍了信息压缩匿名化(ICA)和VEIL架构,一种隐私保护的机器学习框架,通过架构和数学设计实现强隐私保障,而非噪声注入或密码学。ICA在受信任的源环境中嵌入一个监督的多目标编码器,将原始输入转换为低维、任务对齐的潜在表示,确保只有不可逆匿名化的向量被导出到不可信的训练和推理环境中。本文严格证明这些编码在拓扑和信息论论证中结构非可逆,表明即使在理想化的攻击者假设下,逆向也是逻辑上不可能的,并且在实际部署中,攻击者对原始数据的条件熵发散,驱动重建概率趋于零。与以往基于自编码器的ppML方法不同,ICA通过将表示学习与下游监督目标对齐,保留预测效用,从而在无需梯度裁剪、噪声预算或推理时间加密的情况下实现低延迟、高性能的机器学习。VEIL架构强制执行严格的信任边界,支持可扩展的多地区部署,并自然与隐私设计监管框架对齐,建立了一种新的企业ML基础,即使在后量子威胁面前,也是安全、高效且安全的。

英文摘要

Modern machine learning systems increasingly rely on sensitive data, creating significant privacy, security, and regulatory risks that existing privacy-preserving machine learning (ppML) techniques, such as Differential Privacy (DP) and Homomorphic Encryption (HE), address only at the cost of degraded performance, increased complexity, or prohibitive computational overhead. This paper introduces Informationally Compressive Anonymization (ICA) and the VEIL architecture, a privacy-preserving ML framework that achieves strong privacy guarantees through architectural and mathematical design rather than noise injection or cryptography. ICA embeds a supervised, multi-objective encoder within a trusted Source Environment to transform raw inputs into low-dimensional, task-aligned latent representations, ensuring that only irreversibly anonymized vectors are exported to untrusted training and inference environments. The paper rigorously proves that these encodings are structurally non-invertible using topological and information-theoretic arguments, showing that inversion is logically impossible, even under idealized attacker assumptions, and that, in realistic deployments, the attacker conditional entropy over the original data diverges, driving reconstruction probability to zero. Unlike prior autoencoder-based ppML approaches, ICA preserves predictive utility by aligning representation learning with downstream supervised objectives, enabling low-latency, high-performance ML without gradient clipping, noise budgets, or encryption at inference time. The VEIL architecture enforces strict trust boundaries, supports scalable multi-region deployment, and naturally aligns with privacy-by-design regulatory frameworks, establishing a new foundation for enterprise ML that is secure, performant, and safe by construction, even in the face of post-quantum threats.

2603.14698 2026-05-21 cs.RO

Dual Quaternion Based Contact Modeling for Fast and Smooth Collision Recovery of Quadrotors

基于双四元数的接触建模用于四旋翼快速平滑碰撞恢复

Valentin Gaucher, Wenlong Zhang

AI总结 本文提出了一种基于双四元数的接触模型,用于四旋翼在复杂环境中实现快速且平滑的碰撞恢复,通过统一空间扭曲实现法向和切向冲击分量的耦合,减少执行延迟并降低动能峰值。

Comments 8 pages, 5 figures, 2 tables

详情
AI中文摘要

无人飞行器(UAVs)在复杂环境中运行时,需要高效的冲击建模以保持碰撞后稳定性,然而经典冲击接触模型将法向和切向分量解耦。本文提出了一种直接在SE(3)流形上的双四元数冲击重置映射。通过操作统一空间扭曲(统一线性和角速度),所提出的公式在单个闭合表达式中保留法向和切向冲击分量之间的交叉耦合,并将经典解耦的牛顿冲击模型作为特殊情况恢复。设计了一个恢复控制器,将线性和角动量耦合以强制在冲击过程中耗散动能。硬件在环基准测试显示,与优化的矩阵实现相比,执行延迟减少了24%,与位置加四元数(PQ)形式相比减少了20%。在MuJoCo模拟中,经过蒙特卡洛扫掠的冲击角度和摩擦系数测试显示,与已发表的线性阻抗基线相比,位置均方根误差(RMSE)减少了50.8%-75.1%,峰值动能减少了68.7%-85%。

英文摘要

Unmanned aerial vehicles (UAVs) operating in cluttered environments require efficient and accurate impact modeling to maintain stability post collisions, however classical impulse contact models decouple the normal and tangential components. This letter presents a dual quaternion impulse reset map directly on the SE(3) manifold. By operating on the unified spatial twist (unified linear and angular velocities), the proposed formulation retains the cross-coupling between normal and tangential impulse components in a single closed-form expression, and recovers the classical decoupled Newton impulse model as a special case. A recovery controller is designed that couples linear and angular momentum to enforce kinetic energy dissipation across impacts. Hardware-in-the-loop benchmarks demonstrate a 24\% reduction in execution latency compared to an optimized matrix-based implementation, and a 20\% reduction relative to a position-plus-quaternion (PQ) formulation. MuJoCo simulations across Monte Carlo sweeps over impact angles and friction coefficients show a 50.8\%-75.1\% reduction in position root-mean-square error (RMSE) and a 68.7\%-85\% decrease in peak kinetic energy compared to published linear-admittance baselines.

2603.00086 2026-05-21 cs.CL cs.AI cs.SD eess.AS

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

基于迭代的LLM改进法用于法语临床访谈的转录与说话人识别

Ambre Marie, Thomas Bertin, Guillaume Dardenne, Gwenolé Quellec

AI总结 本研究提出一种多轮LLM后处理架构,通过交替进行说话人识别和词识别流程,提高法语医疗对话的转录准确性和说话人归属,通过两个法语临床数据集的消融研究验证了四种设计选择的效果。

详情
AI中文摘要

法语医疗对话的自动语音识别仍然具有挑战性,自发临床语音的词错误率通常超过30%。本研究提出一种多轮LLM后处理架构,通过交替进行说话人识别和词识别流程来提高转录准确性和说话人归属。在两个法语临床数据集(自杀预防电话咨询和术前清醒神经外科会诊)上的消融研究调查了四种设计选择:模型选择、提示策略、流程顺序和迭代深度。使用Qwen3-Next-80B,Wilcoxon符号秩检验证实了在自杀预防对话上词错误率(WDER)的显著降低(p<0.05,n=18),同时在清醒神经外科会诊上保持稳定(n=10),零输出失败和可接受的计算成本(RTF 0.32),表明该方法在离线临床部署中的可行性,有待在更大语料库上验证。

英文摘要

Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p<0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment, pending validation on larger corpora.

2602.10408 2026-05-21 cs.LG cs.CL

Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

预归一化变换器中的门控归一化移除与尺度锚定

Andrei Kanavalau, Carmen Amo Alonso, Sanjay Lall

AI总结 本文研究了预归一化变换器中归一化层的必要性,提出了一种门控归一化移除方法,通过TaperNorm逐步将归一化操作转为样本无关的线性或仿射映射,并揭示了最终归一化层对预logit表示尺度的锚定作用。

详情
AI中文摘要

归一化层在变换器中是标准组件,但其样本依赖的计算在整个训练和推理过程中是否必要尚不明确。本文为预归一化变换器开发了一种门控归一化移除方法。该方法通过TaperNorm实现,从标准RMSNorm/LayerNorm逐步过渡到学习的样本无关线性或仿射映射。一旦门控达到零, tapered层将不再计算每个token的统计信息,所得到的映射可以折叠到相邻的线性投影中。结果表明,在测试的预训练和微调设置中,内部归一化可以逐步移除,且验证损失的增加较小。我们的方法揭示了最终归一化层的独特作用,即它锚定了预logit表示的尺度。有了这个锚定,最后隐藏状态的径向变化不会直接减少损失;当移除它时,通过增加logit的幅度可以实现交叉熵的减少。固定目标尺度损失提供了显式的替代锚定,使在测试范围内能够完全无归一化地进行消融实验。最后,在KV缓存自回归解码基准中,逐步移除内部归一化可提供高达1.14倍的吞吐量,使用显式缩放操作,折叠后可达1.18倍。

英文摘要

Normalization layers are standard in transformers, but it is not clear whether their sample-dependent computations are necessary throughout both training and inference. This work develops a gated normalization-removal approach for pre-norm transformers. The approach is implemented using TaperNorm, which starts from standard RMSNorm/LayerNorm and gradually tapers to learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are no longer computed in the tapered layers and the resulting maps can be folded into adjacent linear projections. The results indicate that internal normalization can be tapered in the tested pre-training and fine-tuning settings with small validation-loss increases. Our approach helps reveal a distinct role for final normalization, namely that it anchors the scale of the pre-logit representation. With this anchor present, radial changes in the last hidden state do not directly reduce the loss; when it is removed, reducing cross-entropy can be achieved by increasing logit magnitudes. A fixed-target scale loss provides an explicit alternative anchor and enables fully norm-free ablations in the tested regimes. Finally, in a KV-cached autoregressive decoding benchmark, tapering internal norms gives up to $1.14\times$ higher throughput with explicit scaling operations and up to $1.18\times$ after folding.

2602.09723 2026-05-21 cs.CL

AI-Assisted Scientific Assessment: A Case Study on Climate Change

AI辅助的科学评估:气候变化案例研究

Christian Buck, Levke Caesar, Michelle Chen Huebscher, Massimiliano Ciaramita, Erich M. Fischer, Zeke Hausfather, Özge Kart Tokmak, Reto Knutti, Markus Leippold, Joseph Ludescher, Katharine J. Mach, Sofia Palazzo Corner, Kasra Rafiezadeh Shahi, Johan Rockström, Joeri Rogelj, Boris Sakschewski

AI总结 本文探讨了AI在科学评估中的应用,通过与气候科学领域13名科学家的合作,测试了基于Gemini的AI环境在评估大西洋经向翻转环流(AMOC)稳定性方面的效果,展示了AI在加速科学流程和提升报告质量方面的贡献,同时强调了专家补充和监督的重要性。

详情
AI中文摘要

新兴的AI合作者范式专注于可重复验证的任务,其中智能体通过'猜测和检查'循环探索搜索空间。这一范式无法扩展到无法重复评估的问题,其真实情况由理论和现有证据的共识综合确定。我们评估了一个基于Gemini的AI环境,该环境旨在支持协作科学评估,并整合到标准科学流程中。与13名气候科学领域的科学家合作,我们测试了该系统在复杂主题——大西洋经向翻转环流(AMOC)稳定性上的表现。我们的结果表明,AI可以加速科学流程。小组通过104次修订循环,在不到46人小时内完成了79篇论文的综合综述。AI的贡献显著:大多数AI生成的内容保留在报告中。AI还帮助维持逻辑一致性和呈现质量。然而,专家补充至关重要,以确保其可接受性:报告中不到一半的内容由AI生成。此外,需要大量监督以扩展和提升内容到严谨的科学标准。

英文摘要

The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in 'guess and check' loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.

2602.08686 2026-05-21 cs.LG cs.AI

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

CompilerKV: 通过离线经验编译实现风险适应性的键值压缩

Ning Yang, Chengzhi Wang, Yibo Liu, Baoliang Tian, Haijun Zhang

AI总结 本文提出CompilerKV,一种通过离线经验编译实现风险适应性的键值压缩方法,通过离线编译校准语料库中的纠正表,将在线纠正减少到O(1)查找加预算限制,从而在多个模型架构上实现了压缩SOTA,并在不同压力条件下保持最优性能。

详情
AI中文摘要

Prefill-only KV compression freezes a token subset at the end of prefill and decodes from it without further eviction. The retention decision is therefore irreversible, yet existing methods estimate the corrective signals it relies on, per-head reliability and prompt-level compression sensitivity, online from a single noisy prompt. We argue this is the wrong statistical unit: these signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise. We introduce extsc{CompilerKV}, a KV-retention policy whose corrective tables are compiled offline from a calibration corpus, reducing online correction after the standard observation-window scan to $O(1)$ lookups plus a budget clamp. We find that compiled retention tables behave as portable architectural priors: rankings transfer across disjoint corpora on four backbones (mean Spearman $arρ{=}0.90$), and direct model-to-model table transfer costs only $0.4$--$0.8$ LongBench points on average. At a 512-token budget, extsc{CompilerKV} attains compressed-SOTA on all four backbones, improving over the strongest prefill-only baseline by $+1.67$ points on average (task-bootstrap 95\% CI $[+1.08,+2.37]$). Pressure regimes amplify the gap: under a fixed $512/32k$ cache ratio, CompilerKV remains the strongest compressed method through 128k RULER ($\sim\!73$ vs.\ FullKV $\sim\!79$, SnapKV $\sim\!38$); on 32k NIAH it reaches $0.89$ vs.\ SnapKV $0.42$; and at 32k input, retaining only $1.56\%$ of the prefill KV, batch-16 serving remains feasible where FullKV is OOM.

英文摘要

Prefill-only KV compression freezes a token subset at the end of prefill and decodes from it without further eviction. The retention decision is therefore irreversible, yet existing methods estimate the corrective signals it relies on, per-head reliability and prompt-level compression sensitivity, online from a single noisy prompt. We argue this is the wrong statistical unit: these signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise. We introduce \textsc{CompilerKV}, a KV-retention policy whose corrective tables are compiled offline from a calibration corpus, reducing online correction after the standard observation-window scan to $O(1)$ lookups plus a budget clamp. We find that compiled retention tables behave as portable architectural priors: rankings transfer across disjoint corpora on four backbones (mean Spearman $\barρ{=}0.90$), and direct model-to-model table transfer costs only $0.4$--$0.8$ LongBench points on average. At a 512-token budget, \textsc{CompilerKV} attains compressed-SOTA on all four backbones, improving over the strongest prefill-only baseline by $+1.67$ points on average (task-bootstrap 95\% CI $[+1.08,+2.37]$). Pressure regimes amplify the gap: under a fixed $512/32k$ cache ratio, CompilerKV remains the strongest compressed method through 128k RULER ($\sim\!73$ vs.\ FullKV $\sim\!79$, SnapKV $\sim\!38$); on 32k NIAH it reaches $0.89$ vs.\ SnapKV $0.42$; and at 32k input, retaining only $1.56\%$ of the prefill KV, batch-16 serving remains feasible where FullKV is OOM.

2602.07832 2026-05-21 cs.LG cs.AI

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

rePIRL: 通过逆强化学习学习PRM以提高LLM推理

Xian Wu, Kaijie Zhu, Ying Zhang, Lun Wang, Wenbo Guo

AI总结 本文提出rePIRL框架,通过逆强化学习学习高效的PRM,无需依赖专家策略的强假设,解决了传统方法中熵崩溃等固有限制问题,通过双学习过程和定制技术提升LLM推理性能,并在数学和编程任务数据集上验证了其有效性。

详情
AI中文摘要

过程奖励已被广泛用于深度强化学习以提高训练效率、减少方差并防止奖励黑客。在LLM推理中,现有工作也探索了各种解决方案来学习有效的过程奖励模型(PRM),有或无专家策略的帮助。然而,现有方法要么依赖于对专家策略的强假设(例如要求其奖励函数),要么受到固有限制(例如熵崩溃),导致PRM效果有限或泛化能力差。在本文中,我们引入了rePIRL,一种受逆强化学习启发的框架,能够在对专家策略假设最少的情况下学习有效的PRM。具体来说,我们设计了一种双学习过程,交替更新策略和PRM。我们的学习算法具有定制技术,以解决将传统逆强化学习扩展到LLM的挑战。我们理论证明,所提出的学习框架可以统一在线和离线PRM学习方法,证明rePIRL可以在最少假设下学习PRM。在标准化数学和编程推理数据集上的经验评估展示了rePIRL在现有方法上的有效性。我们进一步展示了训练的PRM在测试时训练、测试时扩展以及为训练困难问题提供早期信号的应用。最后,我们通过详细的消融研究验证了我们的训练配方和关键设计选择。

英文摘要

Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.

2601.18973 2026-05-21 cs.LG cs.AI cs.SY eess.SY quant-ph

When Does Adaptation Win? Scaling Laws for Meta-Learning in Quantum Control

何时适应胜出?量子控制中元学习的缩放定律

Nima Leclerc, Chris Miller, Nicholas Brawand

AI总结 本文研究了元学习在量子控制中的适应性问题,推导了适应增益的缩放定律,表明适应增益随着梯度步数指数饱和,而随任务方差线性增长,为判断适应的必要性提供了量化标准。

Comments 28 pages, 11 figures

详情
AI中文摘要

量子硬件固有地存在设备异质性和环境漂移,迫使实践者在次优非适应控制器和高成本的设备特定重新校准之间做出选择。我们推导了元学习的缩放定律下限,表明适应增益(任务特定梯度步的预期保真度提升)随着梯度步数指数饱和,而随任务方差线性增长,提供了判断适应是否值得其开销的量化标准。在量子门校准上的验证显示,低方差任务的适应收益微乎其微,但在极端分布外条件(训练噪声的10倍)下,两量子位门的保真度提升超过40%,这对减少云量子处理器上的设备校准时间具有启示。进一步在经典线性二次控制上的验证证实这些定律源于通用优化几何而非量子特定物理。我们还引入了一种少量次预适应协议,能够在3-19%的相对误差范围内,通过N=3-5次探测步估计最优的适应预算。

英文摘要

Quantum hardware suffers from intrinsic device heterogeneity and environmental drift, forcing practitioners to choose between suboptimal non-adaptive controllers or costly per-device recalibration. We derive a scaling law lower bound for meta-learning showing that the adaptation gain (expected fidelity improvement from task-specific gradient steps) saturates exponentially with gradient steps and scales linearly with task variance, providing a quantitative criterion for when adaptation justifies its overhead. Validation on quantum gate calibration shows negligible benefits for low-variance tasks but >40% fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10$\times$ the training noise), with implications for reducing per-device calibration time on cloud quantum processors. Further validation on classical linear-quadratic control confirms these laws emerge from general optimization geometry rather than quantum-specific physics. We further introduce a few-shot pre-adaptation protocol that estimates the optimal adaptation budget from $N{=}3$-5 probe steps within 3-19% relative error across out-of-distribution regimes.

2601.05639 2026-05-21 cs.CV cs.LG

Efficient training for compact compression models via sequential distillation

通过序列知识蒸馏实现紧凑压缩模型的高效训练

Caroline Mazini Rodrigues, Nicolas Keriven, Thomas Maugey

AI总结 本文提出了一种通过序列知识蒸馏减少自动编码器压缩网络的方法,通过简化早期优化目标和逐步引入复杂性,提高了轻量级模型的重建质量与统计保真度,适用于资源受限环境。

详情
AI中文摘要

深度学习图像压缩模型在硬件受限的应用中常面临实际限制。尽管这些模型能够实现高质量的重建,但它们通常复杂、重量大且需要大量的训练数据和计算资源。我们提出了一种方法,通过更稳定的知识蒸馏过程显著减少基于自动编码器的压缩网络。其核心思想是高度减少的架构可以从早期训练中的简化优化目标中受益,随后逐步引入复杂性。因此,我们的方法首先通过序列编码器-解码器知识蒸馏阶段为轻量模型提供稳健的初始化,随后通过标准训练并可使用潜在蒸馏进行正则化。我们在两个不同的架构上评估了所得到的轻量级自动编码器在图像压缩任务中的表现。实验表明,与使用原始损失训练的轻量级自动编码器相比,我们的方法在早期epoch中更好地保持了重建质量和统计保真度,使其在资源受限环境中更具实用性。

英文摘要

Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to significantly reduce autoencoder-based compression networks in a more stable Knowledge Distillation process. The intuition is that highly reduced architectures benefit from simplified optimization objectives in early training, with complexity gradually introduced later. Therefore, our approach begins with a sequential encoder--decoder distillation stage that provides a robust initialization for the lightweight model. This is followed by standard training that can be regularized with latent distillation. We evaluate the resulting lightweight autoencoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity in early epochs better than training lightweight autoencoders with the original loss, making it practical for resource-limited environments.