arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2370
2603.00377 2026-05-29 cs.LG

Improving Full Waveform Inversion in Large Model Era

在大模型时代改进全波形反演

Yinan Feng, Peng Jin, Yuzhe Guo, Yinpeng Chen, Youzuo Lin

AI总结 提出通过协调缩放模型容量、数据多样性和训练策略,使十亿参数模型在简单合成数据上训练后能泛化到复杂地质结构,在OpenFWI等基准上达到最优性能。

详情
AI中文摘要

全波形反演(FWI)是一个高度非线性和不适定的问题,旨在从地表记录的地震波形数据中恢复地下速度图。现有的数据驱动FWI通常使用小模型,因为可用数据集体积有限、地质多样性不足且空间范围小,导致对过拟合的严重担忧。尽管它们在合成数据集上表现良好,但当前方法无法泛化到更真实的地质结构。在这项工作中,我们展示了完全在模拟且相对简单数据上训练的模型能够出色地泛化到具有挑战性且未见过的地质基准。我们提供了一个工作配方,通过协调三个轴上的缩放:模型容量、数据多样性和训练策略,来驯服十亿参数模型用于FWI。我们的模型在OpenFWI上达到了最先进的性能,并显著缩小了数据驱动FWI中的泛化差距。在六个具有挑战性的地球物理基准上,包括Marmousi、2D SEG/EAGE盐体和逆冲断层、2004 BP、Sigsbee和SEAM Phase I,它推断出了训练集中不存在的复杂结构,并带来了显著的性能提升(SSIM从0.5844提高到0.7669)。总体而言,我们的结果表明,通过适当的缩放策略,在简单合成数据上训练的大模型能够实现对更复杂和真实地质结构的显著泛化。

英文摘要

Full Waveform Inversion (FWI) is a highly nonlinear and ill-posed problem that aims to recover subsurface velocity maps from surface-recorded seismic waveforms data. Existing data-driven FWI typically uses small models, as available datasets have limited volume, geological diversity, and spatial extent, leading to substantial concerns about overfitting. Although they perform well on synthetic datasets, current methods fail to generalize to more realistic geological structures. In this work, we show that a model trained entirely on simulated and relatively simple data can generalize remarkably well to challenging and unseen geological benchmarks. We provide a working recipe that tames a billion-parameter model for FWI through coordinated scaling across three axes: model capacity, data diversity, and training strategy. Our model achieves state-of-the-art performance on OpenFWI and significantly narrows the generalization gap in data-driven FWI. Across six challenging geophysical benchmarks, including Marmousi, 2D SEG/EAGE Salt and Overthrust, 2004 BP, Sigsbee, and SEAM Phase I, it infers complex structures absent from the training set and delivers significant performance improvements (SSIM from 0.5844 to 0.7669). Overall, our results demonstrate that with an appropriate scaling strategy, large models trained on simple synthetic data can achieve substantial generalization to more complex and realistic geological structures.

2602.22045 2026-05-29 cs.CL

DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

DLT-Corpus:面向分布式账本技术领域的大规模文本集合

Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu

AI总结 本文构建了DLT-Corpus,一个包含29.8亿词元、覆盖科学文献、专利和社交媒体的大规模领域语料库,并基于此分析了技术涌现模式与市场创新关联,同时发布了领域预训练模型LedgerBERT、情感分析数据集等资源。

Comments Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

详情
AI中文摘要

我们介绍了DLT-Corpus,这是迄今为止面向分布式账本技术(DLT)研究的最大的领域特定文本集合:来自2212万篇文档的29.8亿词元,涵盖科学文献(37,440篇出版物)、美国专利商标局(USPTO)专利(49,023件)和社交媒体(2200万条帖子)。现有的DLT自然语言处理(NLP)资源主要集中在加密货币价格预测和智能合约上,尽管该行业市值约3万亿美元且技术快速演进,但领域特定语言仍未被充分探索。 我们通过分析技术涌现模式和市场-创新相关性展示了DLT-Corpus的实用性。研究发现,技术首先出现在我们的科学文献子集中,然后才出现在专利和社交媒体中,遵循传统的技术转移模式。尽管即使在加密货币寒冬期间社交媒体情绪仍然极度看涨,但科学和专利活动与短期情绪的相关性减弱,而是跟踪整体市场扩张,形成良性循环:研究先于并推动经济增长,而经济增长又为进一步的创新提供资金。 我们发布了DLT-Corpus及配套资源:LedgerBERT(在DLT特定命名实体识别(NER)任务上比BERT-base提升23%)、包含23,301条加密货币新闻标题和描述的情感分析数据集、工具和代码。

英文摘要

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrency price prediction and smart contracts, leaving domain-specific language underexplored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing patterns of technology emergence and market-innovation correlations. Findings reveal that technologies first appear in our scientific literature subset before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grows less tied to short-term sentiment, tracking overall market expansion in a virtuous cycle in which research precedes and enables economic growth that, in turn, funds further innovation. We release the DLT-Corpus and companion artifacts: LedgerBERT (+23% over BERT-base on DLT-specific Named Entity Recognition (NER) task), a sentiment analysis dataset of 23,301 crypto news headlines and descriptions, tools, and code.

2602.19619 2026-05-29 cs.LG

Is Your Diffusion Sampler Actually Correct? A Sampler-Centric Evaluation of Discrete Diffusion Language Models

你的扩散采样器真的正确吗?离散扩散语言模型的采样器中心评估

Luhan Tang, Longxuan Yu, Shaorong Zhang, Greg Ver Steeg

AI总结 针对离散扩散语言模型评估中采样误差与去噪误差混淆的问题,提出基于oracle的采样器中心框架,通过精确隐马尔可夫模型后验隔离采样误差,发现少步离散扩散采样器在oracle去噪器下仍存在分布偏差。

Comments 30 pages, 10 figures

详情
AI中文摘要

离散扩散语言模型(dLLMs)通过并行更新的迭代去噪,为自回归模型(ARMs)提供了一种快速且灵活的替代方案。然而,它们的评估具有挑战性:现有指标将去噪器近似误差与采样动态引起的采样器诱导误差混为一谈,而自回归模型的自回归采样精确反映学习到的概率模型,不会出现此问题。我们引入了一个以采样器为中心的oracle框架,用从真实马尔可夫链导出的精确隐马尔可夫模型后验替换学习到的去噪器,在受控环境中隔离采样器诱导的误差。我们表明,即使在oracle去噪器下,少步离散扩散采样器在分布上也不正确,存在转移级不匹配,只有当步数接近序列长度时才会消失。此外,负对数似然(NLL)、生成困惑度(GenPPL)或MAUVE的改进并不意味着正确的采样。代码可在 https://luhantang.github.io/dllm_sampler 获取。

英文摘要

Discrete diffusion language models (dLLMs) provide a fast and flexible alternative to autoregressive models (ARMs) via iterative denoising with parallel updates. However, their evaluation is challenging: existing metrics conflate denoiser approximation error with sampler-induced error from the sampling dynamics, a problem that does not arise for ARMs whose autoregressive sampling exactly reflects the learned probability model. We introduce a sampler-centric oracle framework that replaces learned denoisers with an exact Hidden Markov Model posterior derived from a ground-truth Markov chain, isolating sampler-induced error in a controlled setting. We show that few-step discrete diffusion samplers are not distributionally correct even under an oracle denoiser, with transition-level mismatch that vanishes only as the number of steps approaches the sequence length. Moreover, improvements in negative log-likelihood (NLL), generative perplexity (GenPPL), or MAUVE do not imply correct sampling. Code is available at https://luhantang.github.io/dllm_sampler

2602.17200 2026-05-29 cs.CV

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

GASS: 几何感知球面采样用于文本到图像生成中解耦多样性增强

Ye Zhu, Kaleb S. Newman, Johannes F. Lutzeyer, Adriana Romero-Soriano, Michal Drozdzal, Olga Russakovsky

AI总结 提出几何感知球面采样(GASS),通过正交分解CLIP嵌入中的提示相关与无关变异方向,沿两轴扩展投影分布以引导采样,在保持图像保真度和语义对齐的同时增强生成多样性。

Comments ICML 2026 Camera-ready. Code available at https://github.com/L-YeZhu/GASS_T2I

详情
AI中文摘要

尽管具有较高的语义对齐性,现代文本到图像(T2I)生成模型仍难以从给定提示中合成多样化的图像。在这项工作中,我们通过几何视角增强T2I多样性。与大多数现有方法主要依赖基于熵的引导来增加样本差异性不同,我们引入了几何感知球面采样(GASS),通过显式控制提示相关和提示无关的变异来源来增强多样性。具体地,我们使用两个正交方向分解CLIP嵌入中的多样性度量:文本嵌入(捕获与提示相关的语义变异)和识别出的正交方向(捕获提示无关的变异,如背景)。基于此分解,GASS增加生成图像嵌入沿两个轴的几何投影分布,并通过沿生成轨迹的扩展预测引导T2I采样过程。我们在不同冻结T2I骨干网络(U-Net和DiT,扩散和流)及基准上的实验证明了解耦多样性增强的有效性,且对图像保真度和语义对齐影响极小。

英文摘要

Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

2602.14399 2026-05-29 cs.CV

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

多轮自适应提示攻击对大型视觉-语言模型

In Chong Choi, Jiacheng Zhang, Feng Liu, Yiliao Song

AI总结 提出多轮自适应提示攻击(MAPA),通过交替文本-视觉攻击动作和跨轮迭代调整攻击轨迹,显著提升对大型视觉-语言模型的多轮越狱攻击成功率。

详情
AI中文摘要

多轮越狱攻击已被证明对纯文本大型语言模型(LLMs)有效,其中恶意内容逐渐引入以绕过安全对齐。然而,将此类攻击有效扩展到大型视觉-语言模型(LVLMs)仍未被充分探索。在本文中,我们发现简单地将视觉输入纳入多轮越狱可能使其更容易防御;例如,过度恶意的视觉内容容易触发安全对齐的LVLMs中的防御机制,导致更保守的响应。基于这一发现,我们提出了多轮自适应提示攻击(MAPA),该攻击:1)在每一轮中,交替文本-视觉攻击动作以引发最恶意的响应;2)跨轮,通过迭代来回优化调整攻击轨迹,逐步放大响应的恶意程度。这种两级设计使MAPA能够持续优于最先进的方法,在最近的基准测试中,针对LLaVA-v1.6-Mistral-7B、Qwen2.5-VL-7B-Instruct、Llama-3.2-Vision-11B-Instruct和GPT-4o-mini,攻击成功率提高了15-30%。我们的代码可在https://github.com/thomaschoi143/MAPA获取。

英文摘要

Multi-turn jailbreak attacks have proven effective against text-only large language models (LLMs), where malicious content is gradually introduced to bypass safety alignment. However, effectively extending such attacks to large vision-language models (LVLMs) remains underexplored. In this paper, we find that naively incorporating visual inputs can make multi-turn jailbreaks easier to defend against; for example, overly malicious visual content will easily trigger the defense mechanism in safety-aligned LVLMs, resulting in more conservative responses. Based on this finding, we propose multi-turn adaptive prompting attack (MAPA) that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 15-30% on recent benchmarks against LLaVA-v1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini. Our code is available at: https://github.com/thomaschoi143/MAPA.

2602.13600 2026-05-29 cs.CV

SAVAA: Mitigating Hallucinations in LVLMs via Step-wise Adaptive Visual Attention Amplification

SAVAA: 通过逐步自适应视觉注意力放大减轻LVLMs中的幻觉

Jiacheng Zhang, Feng Liu, Chao Du, Tianyu Pang

AI总结 提出SAVAA框架,通过视觉接地熵估计幻觉风险并自适应调整视觉注意力放大因子,在多个基准上显著减轻大型视觉语言模型的幻觉。

详情
AI中文摘要

最近一系列无需训练的减轻大型视觉语言模型(LVLMs)幻觉的方法,通过在单次前向传递的自回归生成过程中放大对视觉标记的注意力。我们将这种范式称为视觉注意力放大(VAA)。在本文中,我们识别出现有VAA方法的一个双重失败模式,原因是它们在生成步骤中使用固定的放大因子:在某些步骤可能太弱,无法解决幻觉,而在其他步骤太强,引入新的幻觉。受此发现启发,我们提出逐步自适应视觉注意力放大(SAVAA),一种新的VAA框架,它估计每个生成标记的幻觉风险,并使用估计的风险自适应地放大下一个生成步骤的视觉注意力。具体来说,我们引入视觉接地熵(VGE),一种轻量级的幻觉风险估计器,它用视觉接地增强预测熵,为那些不确定、在图像中接地较弱或两者兼有的标记分配更高的风险。在VGE的指导下,SAVAA使用估计的风险校准下一个生成步骤的VAA因子,对高风险步骤应用更强的放大,对低风险步骤应用更弱的放大。在LLaVA-NeXT-7B、Qwen3-VL-8B和InternVL3.5-8B上,SAVAA在生成幻觉基准(如CHAIR、SHR和AMBER)上显著优于基线方法。代码可在https://github.com/JiachengZ01/SAVVA获取。

英文摘要

A line of recent training-free methods for mitigating hallucinations in large vision-language models (LVLMs) operates by amplifying attention to visual tokens during autoregressive generation within a single forward pass. We refer to this paradigm as visual attention amplification (VAA). In this paper, we identify a dual failure pattern in existing VAA methods caused by their use of a fixed amplification factor across generation steps: it can be too weak at some steps, leaving hallucinations unresolved, while too strong at others, introducing new hallucinations. Motivated by this finding, we propose Step-wise Adaptive Visual Attention Amplification (SAVAA), a new VAA framework that estimates hallucination risk for each generated token and uses the estimated risk to adaptively amplify visual attention at the next generation step. Specifically, we introduce Visual Grounding Entropy (VGE), a lightweight hallucination-risk estimator that augments predictive entropy with visual grounding, assigning higher risk to tokens that are uncertain, weakly grounded in the image, or both. Guided by VGE, SAVAA uses the estimated risk to calibrate the VAA factor for the next generation step, applying stronger amplification to higher-risk steps and weaker amplification to lower-risk steps. Across LLaVA-NeXT-7B, Qwen3-VL-8B, and InternVL3.5-8B, SAVAA significantly outperforms baseline methods on generative hallucination benchmarks such as CHAIR, SHR and AMBER. Code is available at: https://github.com/JiachengZ01/SAVVA.

2602.13436 2026-05-29 cs.RO

Force Sensing for Wearable Human-Robot Interfaces via Fluidic Innervation

用于可穿戴人机界面的力传感:基于流体神经支配

Noah Rubin, Ava Schraeder, Hrishikesh Sahu, Thomas C. Bulea, Lillian Chin

AI总结 通过3D打印硅胶垫中的空气通道测量力,实现线性响应,并验证其在等长扭矩、动态运动和机器人外骨骼中的应用。

Comments 6 pages, 7 figures, accepted to BioRob 2026

详情
AI中文摘要

机械表征人机界面对于理解用户行为和优化可穿戴机器人性能至关重要。由于制造复杂性和非线性传感器响应,该界面的传感器化一直具有挑战性。在这里,我们通过流体神经支配测量人体肢体与设备的相互作用,创建了一个带有嵌入式空气通道的3D打印硅胶垫来测量力。当力施加到垫子上时,空气通道被压缩,导致压力变化,可由现成的压力传感器测量。我们在台架测试中证明,垫子压力与施加的力高度线性相关($R^2 = 0.998$),并通过策略性垫子放置,在临床测力计中确认了与等长膝关节扭矩的强线性关系。我们基于这些理想化设置,在更不受约束的环境中测试垫子性能,包括循环动态和逐步等长肱二头肌弯举。最后,我们将传感器集成到下肢机器人外骨骼中,并在设备未通电的情况下记录重复深蹲期间的垫子压力。垫子压力一致地跟踪深蹲阶段和整体任务动态。总的来说,我们的初步结果表明,流体神经支配是一种易于定制的传感方式,具有高信噪比和时间分辨率,可用于捕捉人机交互。从长远来看,这种方式可能提供一种替代的实时传感输入,用于控制/优化可穿戴机器人系统,并在设备使用期间捕捉用户功能。

英文摘要

Mechanically characterizing the human-machine interface is essential to understanding user behavior and optimizing wearable robot performance. This interface has been challenging to sensorize due to manufacturing complexity and non-linear sensor responses. Here, we measure human limb-device interaction via fluidic innervation, creating a 3D-printed silicone pad with embedded air channels to measure forces. As forces are applied to the pad, the air channels compress, resulting in a pressure change measurable by off-the-shelf pressure transducers. We demonstrate in benchtop testing that pad pressure is highly linearly related to applied force ($R^2 = 0.998$) and confirmed strong linear relationships to isometric knee torque in a clinical dynamometer with strategic pad placement. We built on these idealized settings to test pad performance in more unconstrained settings, including during cyclic dynamic and stepwise isometric bicep curls. Finally, we integrated the sensor into a lower-extremity robotic exoskeleton and recorded pad pressure during repeated squats with the device unpowered. Pad pressure tracked squat phase and overall task dynamics consistently. Collectively, our preliminary results suggest fluidic innervation is a readily customizable sensing modality with high signal-to-noise ratio and temporal resolution for capturing human-machine interaction. In the long-term, this modality may provide an alternative real-time sensing input to control / optimize wearable robotic systems and to capture user function during device use.

2602.12642 2026-05-29 cs.CL cs.AI

Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

超越归一化:重新审视配分函数作为RLVR的难度调度器

Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee, Kyomin Jung

AI总结 本文提出PACED-RL框架,通过重新解释配分函数作为每提示期望奖励信号,利用其指导训练中的问题选择与重放,在保持生成多样性的同时提升样本效率。

详情
AI中文摘要

奖励最大化的RL方法已被证明能够增强LLM的推理性能,但往往导致生成多样性降低。近期工作通过采用GFlowNets来解决这一问题,训练LLM匹配目标分布的同时联合学习其配分函数。与先前将配分函数仅视为归一化器的工作不同,我们将其重新解释为每提示期望奖励(即在线准确率)信号,利用这一未使用的信息来提高样本效率。具体而言,我们首先建立了配分函数与每提示准确率估计之间的理论关系。基于这一关键见解,我们提出了配分函数引导的强化学习(PACED-RL),这是一个后训练框架,利用准确率估计在训练过程中优先考虑信息量大的问题提示,并通过准确率估计误差优先的重放进一步提高样本效率。关键的是,这两个组件都重用了GFlowNet训练中已经产生的信息,有效地将计算开销摊销到现有优化过程中。跨多种基准的大量实验表明,与GRPO和先前的GFlowNet方法相比,性能有显著提升,突显了PACED-RL作为LLM更高效样本的分布匹配训练的有前途方向。

英文摘要

Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation diversity. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.

2602.10765 2026-05-29 cs.LG

Collaborative Threshold Watermarking

协作阈值水印

Tameem Bakr, Anish Ambreth, Nils Lukas

AI总结 针对联邦学习中多客户端模型溯源问题,提出 (t,K)-阈值水印协议,通过秘密共享水印密钥实现至少 t 个客户端协作验证,且抵抗少于 t 客户端的共谋攻击。

详情
AI中文摘要

在联邦学习(FL)中,$K$ 个客户端共同训练一个模型,而不共享原始数据。由于每个参与者投入数据和计算资源,客户端需要机制来后续证明联合训练模型的来源。模型水印在权重中嵌入隐藏信号,但朴素方法要么随着 $K$ 增长,每个客户端的水印被稀释而无法扩展,要么赋予单个客户端验证并可能移除水印的能力。我们引入 $(t,K)$-阈值水印:客户端在训练期间协作嵌入共享水印,而只有至少 $t$ 个客户端的联盟才能重建水印密钥并验证可疑模型。我们秘密共享水印密钥 $τ$,使得少于 $t$ 个客户端的联盟无法重建它,并且可以在不公开 $τ$ 的情况下进行验证。我们在白盒设置中实例化我们的协议,并在 IID 和非 IID 分区上的图像分类任务以及语言模型微调设置中评估它。我们的水印在规模($K=128$)下仍可检测,准确率损失最小,并且在攻击(包括使用多达 20% 训练数据的自适应微调)下仍保持在检测阈值($z\ge 4$)以上。代码可在 https://github.com/tameemalaa/collaborative-threshold-watermark 获取。

英文摘要

In federated learning (FL), $K$ clients jointly train a model without sharing raw data. Because each participant invests data and compute, clients need mechanisms to later prove the provenance of a jointly trained model. Model watermarking embeds a hidden signal in the weights, but naive approaches either do not scale with many clients as per-client watermarks dilute as $K$ grows, or give any individual client the ability to verify and potentially remove the watermark. We introduce $(t,K)$-threshold watermarking: clients collaboratively embed a shared watermark during training, while only coalitions of at least $t$ clients can reconstruct the watermark key and verify a suspect model. We secret-share the watermark key $τ$ so that coalitions of fewer than $t$ clients cannot reconstruct it, and verification can be performed without revealing $τ$ in the clear. We instantiate our protocol in the white-box setting and evaluate it on image classification tasks on both IID and non-IID partitions, as well as language models fine-tuning setting. Our watermark remains detectable at scale ($K=128$) with minimal accuracy loss and stays above the detection threshold ($z\ge 4$) under attacks including adaptive fine-tuning using up to 20% of the training data. Code is available at https://github.com/tameemalaa/collaborative-threshold-watermark.

2602.09499 2026-05-29 cs.LG cs.CR

Computationally Efficient Replicable Learning of Parities and Applications

奇偶性的计算高效可复现学习及其应用

Moshe Noivirt, Jessica Sorrell, Eliad Tsfadia

AI总结 本文提出首个计算高效的奇偶性可复现学习算法,证明可复现学习在一般分布上严格超越SQ学习,并揭示其与差分隐私在样本复杂度上的分离。

详情
AI中文摘要

我们研究了可复现性(Impagliazzo等 [STOC `22], Ghazi等 [NeurIPS `21])与其他稳定性概念之间的计算关系。具体而言,我们关注可复现PAC学习及其与差分隐私(Dwork等 [TCC 2006])和统计查询(SQ)模型(Kearns [JACM `98])的联系。从统计角度看,已知差分隐私学习和可复现学习是等价的,并且严格强于SQ学习。然而,在计算上,所有先前已知的高效(即多项式时间)可复现学习算法都局限于SQ可学习任务或受限分布,这与差分隐私学习形成对比。我们的主要贡献是第一个计算高效的可复现算法,用于在任意分布上可实现地学习奇偶性,这一任务在SQ模型中是困难的,但在差分隐私下是可能的。这一结果首次证明,在一般分布上的高效可复现学习严格扩展了高效SQ学习,并且在能力上更接近高效差分隐私学习,尽管可复现性与隐私之间存在计算分离。此外,我们利用我们的奇偶性学习器证明,假设$RP \neq NP$,将可复现性转化为纯差分隐私需要样本复杂度的严格损失。我们的主要构建模块是一个新的、高效且可复现的算法,给定一组向量,该算法输出其线性张成的一个子空间,该子空间覆盖了大部分向量。

英文摘要

We study the computational relationship between replicability (Impagliazzo et al. [STOC `22], Ghazi et al. [NeurIPS `21]) and other stability notions. Specifically, we focus on replicable PAC learning and its connections to differential privacy (Dwork et al. [TCC 2006]) and to the statistical query (SQ) model (Kearns [JACM `98]). Statistically, it was known that differentially private learning and replicable learning are equivalent and strictly more powerful than SQ-learning. Yet, computationally, all previously known efficient (i.e., polynomial-time) replicable learning algorithms were confined to SQ-learnable tasks or restricted distributions, in contrast to differentially private learning. Our main contribution is the first computationally efficient replicable algorithm for realizable learning of parities over arbitrary distributions, a task that is known to be hard in the SQ-model, but possible under differential privacy. This result provides the first evidence that efficient replicable learning over general distributions strictly extends efficient SQ-learning, and is closer in power to efficient differentially private learning, despite computational separations between replicability and privacy. Additionally, we leverage our parity learner to prove that, assuming $RP \neq NP$, converting replicability to pure differential privacy requires a strict loss in sample complexity. Our main building block is a new, efficient, and replicable algorithm that, given a set of vectors, outputs a subspace of their linear span that covers most of them.

2602.08013 2026-05-29 cs.AI

Small Agent Group is the Future of Digital Health

小型智能体群是数字健康的未来

Yuqiao Meng, Luoxi Tang, Dazheng Zhang, Rafael Brens, Elvys J. Romero, Nancy Guo, Safa Elkefi, Zhaohan Xi

AI总结 本文提出小型智能体群(SAG)通过协作推理替代单一大型模型,在数字健康中实现更优的有效性、可靠性和部署效率。

Comments ICML'26

详情
AI中文摘要

大型语言模型(LLM)在数字健康中的快速采用是由“规模优先”理念驱动的,即假设临床智能随模型大小和数据量增加而提升。然而,现实世界的临床需求不仅包括有效性,还包括可靠性和合理的部署成本。由于临床决策本质上是协作性的,我们挑战单一模型扩展范式,并询问小型智能体群(SAG)是否能支持更好的临床推理。SAG通过协作审议过程分配推理、基于证据的分析和关键审计,从单一模型智能转向集体专业知识。为了评估SAG的临床实用性,我们使用涵盖有效性、可靠性和部署成本的多项临床指标进行了广泛评估。结果表明,无论是否进行额外优化或检索增强生成,SAG相比单一巨型模型都取得了更优的性能。这些发现表明,SAG所代表的协同推理可以在临床环境中替代模型参数增长。总体而言,SAG为数字健康提供了一种可扩展的解决方案,更好地平衡了有效性、可靠性和部署效率。

英文摘要

The rapid adoption of large language models (LLMs) in digital health has been driven by a "scaling-first" philosophy, i.e., the assumption that clinical intelligence increases with model size and data. However, real-world clinical needs include not only effectiveness, but also reliability and reasonable deployment cost. Since clinical decision-making is inherently collaborative, we challenge the monolithic scaling paradigm and ask whether a Small Agent Group (SAG) can support better clinical reasoning. SAG shifts from single-model intelligence to collective expertise by distributing reasoning, evidence-based analysis, and critical audit through a collaborative deliberation process. To assess the clinical utility of SAG, we conduct extensive evaluations using diverse clinical metrics spanning effectiveness, reliability, and deployment cost. Our results show that SAG achieves superior performance compared to a single giant model, both with and without additional optimization or retrieval-augmented generation. These findings suggest that the synergistic reasoning represented by SAG can substitute for model parameter growth in clinical settings. Overall, SAG offers a scalable solution to digital health that better balances effectiveness, reliability, and deployment efficiency.

2602.07044 2026-05-29 cs.CV cs.AI

PipeMFL-240K: A Large-scale Dataset and Benchmark for Object Detection in Pipeline Magnetic Flux Leakage Imaging

PipeMFL-240K:管道磁通量泄漏成像中目标检测的大规模数据集与基准

Tianyi Qu, Songxiao Yang, Haolin Wang, Huadong Song, Xiaoting Guo, Wenguang Hu, Guanlin Liu, Honghe Chen, Yafei Ou

AI总结 为解决管道磁通量泄漏检测中缺乏大规模公开数据集和基准的问题,构建了包含249,320张图像和200,020个边界框标注的PipeMFL-240K数据集,并评估了现有目标检测器,揭示了其在长尾分布、小目标和类内变异等挑战下的性能不足。

Comments Accepted by ACM KDD 2026 Datasets and Benchmarks Track

详情
AI中文摘要

管道完整性对工业安全和环境保护至关重要,磁通量泄漏(MFL)检测是一种主要的无损检测技术。尽管深度学习在自动化MFL解释方面具有前景,但由于缺乏大规模公开数据集和基准,可靠模型的进展受到限制,导致公平比较和可重复评估困难。我们引入了 extbf{PipeMFL-240K},这是一个大规模、精心标注的数据集和基准,用于管道MFL伪彩色图像中的复杂目标检测。PipeMFL-240K反映了真实检测的复杂性,并提出了几个独特挑战:(i) 覆盖 extbf{12}个类别的极端长尾分布,(ii) 大量仅包含少数像素的小目标,(iii) 显著的类内变异。该数据集包含 extbf{249,320}张图像和 extbf{200,020}个高质量边界框标注,采集自12条总长约 extbf{1,530}公里的管道。我们使用最先进的目标检测器进行了大量实验以建立基线。结果表明,现代检测器仍然难以应对MFL数据的固有特性,凸显了巨大的改进空间,而PipeMFL-240K为驱动未来研究提供了可靠且具有挑战性的试验平台。作为管道MFL检测领域首个如此规模和范围的数据集和基准,它为高效的管道诊断和维护规划提供了关键基础,并有望加速基于MFL的管道完整性评估中的算法创新和可重复研究。

英文摘要

Pipeline integrity is critical to industrial safety and environmental protection, with Magnetic Flux Leakage (MFL) detection being a primary non-destructive testing technology. Despite the promise of deep learning for automating MFL interpretation, progress toward reliable models has been constrained by the absence of a large-scale public dataset and benchmark, making fair comparison and reproducible evaluation difficult. We introduce \textbf{PipeMFL-240K}, a large-scale, meticulously annotated dataset and benchmark for complex object detection in pipeline MFL pseudo-color images. PipeMFL-240K reflects real-world inspection complexity and poses several unique challenges: (i) an extremely long-tailed distribution over \textbf{12} categories, (ii) a high prevalence of tiny objects that often comprise only a handful of pixels and (iii) substantial intra-class variability. The dataset contains \textbf{249,320} images and \textbf{200,020} high-quality bounding-box annotations, collected from 12 pipelines spanning approximately \textbf{1,530} km. Extensive experiments are conducted with state-of-the-art object detectors to establish baselines. Results show that modern detectors still struggle with the intrinsic properties of MFL data, highlighting considerable headroom for improvement, while PipeMFL-240K provides a reliable and challenging testbed to drive future research. As the first public dataset and the first benchmark of this scale and scope for pipeline MFL inspection, it provides a critical foundation for efficient pipeline diagnostics as well as maintenance planning and is expected to accelerate algorithmic innovation and reproducible research in MFL-based pipeline integrity assessment.

2602.06282 2026-05-29 cs.CV q-bio.QM

An Interpretable Vision Transformer as a Fingerprint-Based Diagnostic Aid for Kabuki and Wiedemann-Steiner Syndromes

一种可解释的基于指纹的视觉Transformer辅助诊断Kabuki和Wiedemann-Steiner综合征

Marilyn Lionts, Arnhildur Tomasdottir, Viktor I. Agustsson, Yuankai Huo, Hans T. Bjornsson, Lotta M. Ellingsen

AI总结 本研究提出一种基于视觉Transformer的深度学习模型,利用指纹图像区分Kabuki综合征(KS)和Wiedemann-Steiner综合征(WSS)患者与健康对照,并通过注意力可视化增强可解释性,为罕见遗传病的非侵入性诊断提供新工具。

详情
AI中文摘要

Kabuki综合征(KS)和Wiedemann-Steiner综合征(WSS)是罕见但不同的发育障碍,具有重叠的临床特征,包括神经发育迟缓、生长受限和持续性胎儿指尖垫。尽管基因检测仍是诊断的金标准,但由于基因检测和专业知识获取的障碍,许多KS或WSS患者仍未得到诊断。皮纹异常虽然是几种遗传综合征的既定标志,但在分子检测时代仍是一种未被充分利用的诊断信号。本研究提出一种基于视觉Transformer的深度学习模型,利用指纹图像区分KS和WSS患者与未受影响的对照组以及彼此。我们在三个二分类任务中评估模型性能。在三个分类任务中,模型在对照组vs. KS、对照组vs. WSS和KS vs. WSS上分别达到了0.80、0.73和0.85的AUC分数,相应的F1分数分别为0.71、0.72和0.83。除了分类,我们应用基于注意力的可视化来识别对模型预测最显著的指纹区域,增强了可解释性。总之,这些发现表明存在综合征特异性的指纹特征,证明了基于指纹的人工智能(AI)工具作为一种非侵入性、可解释且可获取的未来诊断辅助手段,用于早期诊断未充分诊断的遗传综合征的可行性。

英文摘要

Kabuki syndrome (KS) and Wiedemann-Steiner syndrome (WSS) are rare but distinct developmental disorders that share overlapping clinical features, including neurodevelopmental delay, growth restriction, and persistent fetal fingertip pads. While genetic testing remains the diagnostic gold standard, many individuals with KS or WSS remain undiagnosed due to barriers in access to both genetic testing and expertise. Dermatoglyphic anomalies, despite being established hallmarks of several genetic syndromes, remain an underutilized diagnostic signal in the era of molecular testing. This study presents a vision transformer-based deep learning model that leverages fingerprint images to distinguish individuals with KS and WSS from unaffected controls and from one another. We evaluate model performance across three binary classification tasks. Across the three classification tasks, the model achieved AUC scores of 0.80 (control vs. KS), 0.73 (control vs. WSS), and 0.85 (KS vs. WSS), with corresponding F1 scores of 0.71, 0.72, and 0.83, respectively. Beyond classification, we apply attention-based visualizations to identify fingerprint regions most salient to model predictions, enhancing interpretability. Together, these findings suggest the presence of syndrome-specific fingerprint features, demonstrating the feasibility of a fingerprint-based artificial intelligence (AI) tool as a noninvasive, interpretable, and accessible future diagnostic aid for the early diagnosis of underdiagnosed genetic syndromes.

2602.05786 2026-05-29 cs.LG stat.AP stat.ML

Selecting Hyperparameters for Tree-Boosting

选择树提升的超参数

Floris Jan Koster, Fabio Sigrist

AI总结 本文通过59个数据集比较了多种超参数优化方法,发现SMAC方法显著优于其他方法,并揭示了超参数调优的关键因素。

详情
AI中文摘要

树提升是一种广泛用于表格数据的机器学习技术。然而,其样本外准确性严重依赖于多个超参数。在本文中,我们使用59个回归和分类数据集,实证比较了几种流行的树提升超参数优化方法,包括随机网格搜索、树结构Parzen估计器(TPE)、基于高斯过程的贝叶斯优化(GP-BO)、Hyperband、基于序列模型的算法配置(SMAC)方法以及确定性全网格搜索。我们发现SMAC方法明显优于所有其他考虑的方法。我们进一步观察到:(i)需要相对较大的试验次数(大于100)才能进行准确的调优,(ii)使用超参数的默认值会产生非常不准确的模型,(iii)所有考虑的超参数都可能对树提升的准确性产生实质性影响,即不存在一组比其他超参数更重要的超参数,以及(iv)对于回归任务,使用早停法选择提升迭代次数比将其包含在搜索空间中能产生更准确的结果。

英文摘要

Tree-boosting is a widely used machine learning technique for tabular data. However, its out-of-sample accuracy is critically dependent on multiple hyperparameters. In this article, we empirically compare several popular methods for hyperparameter optimization for tree-boosting including random grid search, the tree-structured Parzen estimator (TPE), Gaussian-process-based Bayesian optimization (GP-BO), Hyperband, the sequential model-based algorithm configuration (SMAC) method, and deterministic full grid search using $59$ regression and classification data sets. We find that the SMAC method clearly outperforms all the other considered methods. We further observe that (i) a relatively large number of trials larger than $100$ is required for accurate tuning, (ii) using default values for hyperparameters yields very inaccurate models, (iii) all considered hyperparameters can have a material effect on the accuracy of tree-boosting, i.e., there is no small set of hyperparameters that is more important than others, and (iv) choosing the number of boosting iterations using early stopping yields more accurate results compared to including it in the search space for regression tasks.

2602.05370 2026-05-29 cs.CL

Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning

挖掘还是合成?重新思考数学推理迭代对齐中的探索效率

Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Hejin Wang, Jiansheng Wei, Xiaojun Meng, Min Zhang

AI总结 针对数学推理任务中迭代DPO对齐时高N采样收益递减且引入噪声的问题,提出PACE框架,通过低预算探索与纠错合成偏好对,以约1/5计算量达到或超越高N基线性能。

详情
AI中文摘要

迭代直接偏好优化(DPO)已成为在推理任务中对齐大语言模型的广泛使用的范式。现有方法通常依赖Best-of-N采样($N\geq8$)从分布尾部挖掘正轨迹。在这项工作中,我们表明在数学推理中,增加$N$会导致收益递减,同时增加验证器引起的假阳性风险和策略更新所需的数据分布偏移。为了解决这个问题,我们引入了PACE(通过纠错探索的近端对齐),一种基于生成的纠错框架,用低预算探索($2\leq N\leq3$)取代穷举挖掘。PACE不是搜索越来越稀有的正样本,而是通过纠错后见优化和验证引导过滤,从失败的探索中合成高保真偏好对。实验上,PACE匹配或超过了DPO-R1($N=16$)的性能,同时使用约$1/5$的计算量,并且在20%标签损坏下保持鲁棒,而高$N$基线表现出明显更高的噪声利用。

英文摘要

Iterative Direct Preference Optimization (DPO) has emerged as a widely used paradigm for aligning Large Language Models on reasoning tasks. Existing approaches typically rely on Best-of-N sampling ($N\geq8$) to mine positive trajectories from the distribution tail. In this work, we show that in mathematical reasoning, increasing $N$ yields diminishing returns while increasing verifier-induced false-positive risk and the distribution shift required for policy updates. To address this, we introduce PACE (Proximal Alignment via Corrective Exploration), a generation-based corrective framework that replaces exhaustive mining with low-budget exploration ($2\leq N\leq3$). Rather than searching for increasingly rare positive samples, PACE synthesizes high-fidelity preference pairs from failed explorations through corrective hindsight refinement and verification-guided filtering. Empirically, PACE matches or exceeds the performance of DPO-R1 ($N=16$) while using about $1/5$ of the compute, and remains robust under 20\% label corruption, where high-$N$ baselines exhibit substantially higher noise exploitation.

2602.04729 2026-05-29 cs.CL

"Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs

“Be My Cheese?”:多语言大语言模型中机器翻译的文化细微差别基准

Madison Van Doren, Casey Ford, Jennifer Barajas, Riley VanMeter, Cory Holland

AI总结 本文通过大规模人工评估基准,研究多语言大语言模型在机器翻译中处理文化细微差别(如习语、双关语、节日和文化概念)的能力,发现语法准确性与文化共鸣之间存在显著差距。

Comments ACL 2026: Natural Language Generation, Evaluation, and Metrics (GEM) Workshop

详情
AI中文摘要

我们提出了一个大规模人工评估基准,用于评估最先进的多语言大语言模型(LLMs)在机器翻译中的文化本地化能力。现有的机器翻译基准强调词元和语法准确性,但往往忽略了实际本地化所需的语用和文化能力。基于一项涵盖20种语言87个翻译的试点研究,我们评估了7个多语言LLMs在15个目标语言上的表现,每种语言有5名母语评分员。每位评分员对全文翻译和包含文化细微语言(习语、双关语、节日和文化嵌入概念)的片段级别实例,按0-3的序数质量等级评分;片段评分还包括一个“不适用”选项,用于未翻译的片段。在全文评估中,平均整体质量适中(1.68/3):GPT-5(2.10/3)、Claude Sonnet 4(1.97/3)和Mistral Medium 3.1(1.84/3)构成最强梯队,灾难性失败较少。片段级别结果显示明显的类别效应:节日(2.20/3)和文化概念(2.19/3)的翻译明显优于习语(1.65/3)和双关语(1.45/3),且习语最可能未被翻译。使用Krippendorff's α和Gwet's AC2评估评分者间信度,显示总体一致性中等(Krippendorff's α = 0.45),其中双关语的一致性最低。这些发现表明语法充分性与文化共鸣之间存在持续差距。据我们所知,这是第一个明确关注翻译和本地化中文化细微差别的多语言、人工标注基准。结果凸显了对文化信息训练数据、改进跨语言语用学以及支持系统性文化翻译基准评估框架的需求。

英文摘要

We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but often overlook the pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Each rater scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0-3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 4 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate notably better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. Inter-rater reliability was assessed using Krippendorff's α and Gwet's AC2, indicating moderate agreement overall (Krippendorff's α = 0.45) with the lowest agreement for puns. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation. The results highlight the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation frameworks that support systematic benchmarking of culturally grounded translation.

2602.04516 2026-05-29 cs.RO

TACO: Temporal Consensus Optimization for Continual Neural Mapping

TACO:用于连续神经映射的时间共识优化

Xunlan Zhou, Hongrui Zhao, Negar Mehr

AI总结 提出TACO框架,通过将映射建模为时间共识优化问题,无需回放历史数据即可实现动态环境下的连续神经映射,平衡了内存效率与适应性。

Comments In: Robotics: Science and Systems (RSS 2026)

详情
AI中文摘要

神经隐式映射已成为机器人导航和场景理解的有力范式。然而,实际机器人部署需要在严格的内存和计算约束下持续适应变化的环境,现有映射系统无法支持这一点。大多数先前方法依赖于回放历史观测来保持一致性,并假设静态场景。因此,它们无法适应动态机器人环境中的连续学习。为应对这些挑战,我们提出TACO(时间共识优化),一种无回放的连续神经映射框架。我们将映射重新表述为时间共识优化问题,其中将过去的模型快照视为时间邻居。直观上,我们的方法类似于模型咨询其自身的过去知识。我们通过强制与历史表示进行加权共识来更新当前地图。我们的方法允许可靠的历史几何约束优化,同时允许不可靠或过时的区域根据新观测进行修订。TACO在无需存储或回放先前数据的情况下实现了内存效率与适应性之间的平衡。通过广泛的模拟和真实世界实验,我们表明TACO能够稳健地适应场景变化,并持续优于其他连续学习基线。代码可在 https://iconlab.negarmehr.com/TACO 获取。

英文摘要

Neural implicit mapping has emerged as a powerful paradigm for robotic navigation and scene understanding. However, real-world robotic deployment requires continual adaptation to changing environments under strict memory and computation constraints, which existing mapping systems fail to support. Most prior methods rely on replaying historical observations to preserve consistency and assume static scenes. As a result, they cannot adapt to continual learning in dynamic robotic settings. To address these challenges, we propose TACO (TemporAl Consensus Optimization), a replay-free framework for continual neural mapping. We reformulate mapping as a temporal consensus optimization problem, where we treat past model snapshots as temporal neighbors. Intuitively, our approach resembles a model consulting its own past knowledge. We update the current map by enforcing weighted consensus with historical representations. Our method allows reliable past geometry to constrain optimization while permitting unreliable or outdated regions to be revised in response to new observations. TACO achieves a balance between memory efficiency and adaptability without storing or replaying previous data. Through extensive simulated and real-world experiments, we show that TACO robustly adapts to scene changes, and consistently outperforms other continual learning baselines. Code is available at https://iconlab.negarmehr.com/TACO

2602.02909 2026-05-29 cs.AI cs.FL cs.LG

Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs

关于推理的推理:LLM中思维链令牌复杂度的BAPO界限

Kiran Tomlinson, Tobias Schnabel, Adith Swaminathan, Jennifer Neville

AI总结 通过扩展BAPO模型,证明二元多数、三元组匹配和图可达性三个任务需要Ω(n)个思维链令牌,实验验证了线性缩放与理论下界一致。

Comments 31 pages; accepted to ICML '26

详情
AI中文摘要

通过思维链(CoT)推理进行推理时扩展是当前最先进LLM性能的主要驱动力,但会带来显著的延迟和计算成本。我们解决了一个基本的理论问题:随着输入规模增长,需要多少推理令牌才能解决问题?通过扩展有界注意力前缀预言机(BAPO)模型——一种量化任务所需信息流的LLM抽象——我们证明了三个典型的BAPO困难任务所需的CoT令牌下界:二元多数、三元组匹配和图可达性。我们证明当输入规模为$n$时,每个任务需要$Ω(n)$个推理令牌。我们通过显式构造给出了匹配或接近匹配的上界。最后,我们在前沿推理模型上的实验显示,这些任务上的推理令牌数量近似线性缩放,且在推理预算受限时出现失败,这与我们的理论下界一致。总之,我们的结果识别了通过CoT进行推理时计算的基本瓶颈,并为分析最优推理长度提供了一种原则性工具。

英文摘要

Inference-time scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substantial latency and compute costs. We address a fundamental theoretical question: how many reasoning tokens are required to solve a problem as input size grows? By extending the bounded attention prefix oracle (BAPO) model--an abstraction of LLMs that quantifies the information flow required to solve a task--we prove lower bounds on the CoT tokens required for three canonical BAPO-hard tasks: binary majority, triplet matching, and graph reachability. We show that each requires $Ω(n)$ reasoning tokens when the input size is $n$. We complement these results with matching or near-matching upper bounds via explicit constructions. Finally, our experiments with frontier reasoning models show approximately linear reasoning token scaling on these tasks and failures when constrained to smaller reasoning budgets, consistent with our theoretical lower bounds. Together, our results identify fundamental bottlenecks in inference-time compute through CoT and offer a principled tool for analyzing optimal reasoning length.

2602.01058 2026-05-29 cs.LG cs.AI cs.CL

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

好的SFT优化SFT,更好的SFT为强化学习做准备

Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng

AI总结 针对当前SFT-RL流程中离线SFT数据分布与在线RL策略分布不匹配的问题,提出基于策略评估的离线学习损失重加权方法PEAR,通过重要性采样重加权SFT损失,提升后续RL训练效果。

详情
AI中文摘要

推理大语言模型的后训练是一个整体过程,通常包括离线SFT阶段和后续的在线强化学习(RL)阶段。然而,SFT通常被孤立地优化,仅追求最大化SFT性能。我们表明,在相同的RL训练后,从更强的SFT检查点初始化的模型可能显著劣于从较弱检查点初始化的模型。我们将此归因于当前SFT-RL流程中典型的错配:生成离线SFT数据的分布可能与在线RL期间优化的策略(该策略从其自身的rollout中学习)存在显著差异。我们提出PEAR(基于策略评估的离线学习损失重加权算法),这是一种在SFT阶段纠正此错配并让模型更好地为RL做准备的方法。PEAR使用重要性采样来重加权SFT损失,具有三种变体,分别在token、块和序列级别操作。它可以用于增强标准SFT目标,并且一旦收集到离线数据的概率,仅需很少的额外训练开销。我们在可验证推理游戏和数学推理任务上对Qwen 2.5和3以及DeepSeek蒸馏模型进行了控制实验。PEAR在标准SFT基础上持续提升了RL后性能,在AIME2025上pass@8增益高达14.6%。我们的结果表明,通过设计和评估SFT时考虑下游RL而非孤立进行,PEAR是迈向更全面的大语言模型后训练的有效一步。

英文摘要

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

2602.00994 2026-05-29 cs.AI

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

推理与工具使用在智能体强化学习中的竞争:从量化干扰到解耦调优

Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, Tieying Zhang

AI总结 本文通过引入能力效应归因(CEA)量化推理与工具使用行为之间的干扰,并提出解耦动作-推理调优(DART)框架,通过分离参数更新来提升智能体强化学习的性能。

详情
AI中文摘要

智能体强化学习(ARL)训练大型语言模型将推理与外部工具执行交错进行,以解决复杂任务。大多数现有ARL方法训练一组参数来支持推理和工具使用行为,隐含假设联合训练能提升整体智能体性能。尽管被广泛采用,这一假设很少得到实证检验。本文通过引入能力效应归因(CEA)系统性地检验这一假设,提供了推理与工具使用行为之间干扰的定量证据。通过深入分析,我们表明这两种能力常常导致不一致的梯度方向,产生训练干扰,削弱联合优化的有效性,并挑战了主流的ARL范式。为解决此问题,我们提出解耦动作-推理调优(DART),一个简单高效的框架,通过独立的低秩适应模块显式解耦推理和工具使用的参数更新。仅凭这一简单改变,DART在检索增强问答和NL2SQL的十三个基准上超越了所有联合优化基线,并接近2-智能体上界,进一步支持了我们在共享优化下能力干扰的发现。

英文摘要

Agentic Reinforcement Learning (ARL) trains large language models to interleave reasoning with external tool execution to solve complex tasks. Most existing ARL methods train a single set of parameters to support both reasoning and tool-use behaviors, implicitly assuming that joint training leads to improved overall agent performance. Despite its widespread adoption, this assumption has rarely been examined empirically. In this paper, we systematically examine this assumption by introducing Capability Effect Attribution (CEA), which provides quantitative evidence of interference between reasoning and tool-use behaviors. Through an in-depth analysis, we show that these two capabilities often induce misaligned gradient directions, leading to training interference that undermines the effectiveness of joint optimization and challenges the prevailing ARL paradigm. To address this issue, we propose Disentangled Action--Reasoning Tuning (DART), a simple and efficient framework that explicitly decouples parameter updates for reasoning and tool use via separate low-rank adaptation modules. With this simple change alone, DART outperforms all joint-optimization baselines and approaches the 2-Agent upper bound across thirteen benchmarks on retrieval-augmented QA and NL2SQL, further supporting our finding of capability interference under shared optimization.

2601.22661 2026-05-29 cs.SD

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

评估与奖励基于平均延续对数概率的表达性角色扮演TTS的LALM

Yong Ren, Jingbei Li, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang

AI总结 提出平均延续对数概率(MCLP)作为评估指标和奖励信号,用于提升大型音频语言模型在角色扮演文本转语音任务中的风格一致性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型音频语言模型(LALM)的最新进展已将文本转语音(TTS)扩展到交互式角色扮演场景,这要求高表达力和严格遵守角色扮演指令。然而,现有模型在多轮对话中难以保持与角色档案和场景描述一致的风格。一个关键瓶颈是缺乏量化说话风格的客观指标。为弥补这一差距,我们提出平均延续对数概率(MCLP)作为评估指标和奖励信号,并在基于LALM的角色扮演TTS(RP-TTS)任务上进行了验证。MCLM利用预训练LALM的上下文学习能力,测量真实语音标记在由转录文本、生成语音和重复转录文本组成的上下文历史条件下的似然性,作为风格连续性的代理。此外,我们使用MCLP作为强化学习奖励,以增强生成语音与角色扮演指令之间的风格对齐。为支持该任务,我们构建了一个大规模带有丰富场景和角色标注的RP-TTS数据集。实验表明,MCLP与人类对风格一致性的判断高度一致,并且作为改进RP-TTS的有效奖励,在客观指标和主观评估中均带来一致提升。我们的代码已公开于https://github.com/y-ren16/MCLP。

英文摘要

Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we propose Mean Continuation Log-Probability (MCLP) as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. MCLP leverages the in-context learning capability of pretrained LALMs to measure the likelihood of ground-truth speech tokens conditioned on a contextual history consisting of the transcript, generated speech, and repeated transcript, serving as a proxy for stylistic continuity. Furthermore, we employ MCLP as a reinforcement learning reward to enhance the style alignment between generated speech and role-play instructions. To support this task, we construct a large-scale RP-TTS dataset with rich scene and character annotations. Experiments demonstrate that MCLP is well aligned with human judgments of stylistic consistency and serves as an effective reward for improving RP-TTS, leading to consistent gains in both objective metrics and subjective evaluations. Our code is publicly available at https://github.com/y-ren16/MCLP.

2601.22531 2026-05-29 cs.LG cs.AI

Learn from A Rationalist: Distilling Intermediate Interpretable Rationales

向理性主义者学习:蒸馏中间可解释原理

Jiayi Dai, Randy Goebel

AI总结 提出REKD方法,通过知识蒸馏将教师模型的可解释原理和预测传授给学生模型,提升基于较弱神经网络的可解释原理提取模型的预测性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

由于深度神经网络(DNN)的广泛使用,尤其是在高风险领域,DNN的可解释性受到了越来越多的关注。原理提取(RE)的总体思想是通过选择-预测架构为DNN提供一个可解释的设计框架,其中两个神经网络分别联合学习进行特征选择和预测。仅依赖于最终任务预测的远程监督,学习选择特征子集(或原理)的过程需要在所有可能的特征组合空间中进行搜索,这在计算上具有挑战性,当基础神经网络能力不足时甚至更加困难。为了提高基于能力较弱或较小神经网络(即学生)的RE模型的预测性能,我们提出了REKD(基于知识蒸馏的原理提取),其中学生RE模型除了自身的RE优化外,还从教师(即理性主义者)的原理和预测中学习。这种对RE的结构调整与人类如何从可解释和可验证的知识中有效学习的方式高度一致。由于该方法与神经模型无关,任何黑盒神经网络都可以作为骨干模型集成。为了证明REKD的可行性,我们使用BERT和视觉变换器(ViT)模型的多种变体进行了实验。我们在语言和视觉分类数据集(即IMDB电影评论、CIFAR 10和CIFAR 100)上的实验表明,REKD显著提高了学生RE模型的预测性能。

英文摘要

Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction (RE) is to provide an interpretable-by-design framework for DNNs via a select-predict architecture where two neural networks learn jointly to perform feature selection and prediction, respectively. Given only the remote supervision from the final task prediction, the process of learning to select subsets of features (or rationales) requires searching in the space of all possible feature combinations, which is computationally challenging and even harder when the base neural networks are not sufficiently capable. To improve the predictive performance of RE models that are based on less capable or smaller neural networks (i.e., the students), we propose REKD (Rationale Extraction with Knowledge Distillation) where a student RE model learns from the rationales and predictions of a teacher (i.e., a rationalist) in addition to the student's own RE optimization. This structural adjustment to RE aligns well with how humans could learn effectively from interpretable and verifiable knowledge. Because of the neural-model agnostic nature of the method, any black-box neural network could be integrated as a backbone model. To demonstrate the viability of REKD, we conduct experiments with multiple variants of BERT and vision transformer (ViT) models. Our experiments across language and vision classification datasets (i.e., IMDB movie reviews, CIFAR 10 and CIFAR 100) show that REKD significantly improves the predictive performance of the student RE models.

2601.22347 2026-05-29 cs.LG cs.AI

Pushing the Limits of Block Rotations in Post-Training Quantization

推动后训练量化中块旋转的极限

Sai Sanjeet, Ian Colbert, Pablo Monteagudo-Lago, Giuseppe Franco, Yaman Umuroglu, Nicholas J. Fraser

AI总结 本文提出PeRQ框架,通过置换和旋转重新分布激活值,以克服块旋转在抑制异常值时的几何限制,显著提升后训练量化精度。

详情
AI中文摘要

最近的后训练量化(PTQ)方法采用块旋转来在舍入前扩散异常值。虽然这减少了在线全向量旋转的开销,但块结构对异常值抑制的影响仍知之甚少。为填补这一空白,我们首次对块Hadamard旋转的异常值抑制进行了系统的非渐近分析。我们的分析表明,异常值抑制从根本上受限于输入向量的几何结构。特别地,在确定性最坏情况下,当旋转前的ℓ1范数质量在块间均匀分布时,旋转后的异常值最小。受这些见解的启发,我们引入了PeRQ(置换、旋转、然后量化),一个在旋转前通过置换重新分布激活质量的PTQ框架。我们提出了一种贪婪质量扩散算法,通过均衡期望的块间ℓ1范数来校准置换。为避免增加推理开销,我们识别了Transformer架构中置换等变区域,在部署前将这些置换合并到模型权重中。实验表明,PeRQ在所有块大小上一致地提高了精度,在将Llama3 1B量化为INT4且块大小为16时,恢复了全向量旋转困惑度的90%,而未经置换时仅为46%。

英文摘要

Recent post-training quantization (PTQ) methods have adopted block rotations to diffuse outliers prior to rounding. While this reduces the overhead of online full-vector rotations, the effect of block structure on outlier suppression remains poorly understood. To fill this gap, we present the first systematic, non-asymptotic analysis of outlier suppression for block Hadamard rotations. Our analysis reveals that outlier suppression is fundamentally limited by the geometry of the input vector. In particular, in the deterministic worst case, post-rotation outliers are minimized when the pre-rotation $\ell_1$ norm mass is evenly distributed across blocks. Guided by these insights, we introduce PeRQ (Permute, Rotate, then Quantize), a PTQ framework that redistributes activation mass via permutations prior to rotation. We propose a greedy mass diffusion algorithm to calibrate permutations by equalizing the expected blockwise $\ell_1$ norms. To avoid adding inference overhead, we identify permutation-equivariant regions in transformer architectures to merge these permutations into model weights before deployment. Experiments show that PeRQ consistently improves accuracy across all block sizes, recovering up to 90% of the full-vector rotation perplexity when quantizing Llama3 1B to INT4 with block size 16, compared to 46% without permutations.

2601.21725 2026-05-29 cs.CL cs.LG

Procedural Pretraining: Warming Up Language Models with Abstract Data

程序化预训练:用抽象数据预热语言模型

Liangze Jiang, Zachary Shinnick, Anton van den Hengel, Hemanth Saratchandran, Damien Teney

AI总结 提出程序化预训练方法,通过在抽象结构化数据(如形式语言生成的程序数据)上预训练语言模型,显著提升其推理能力并加速后续语义知识学习,实验表明仅需0.1-0.3%的程序数据即可超越标准预训练。

Comments ICML 2026. Project page: https://zlshinnick.github.io/procedural-pretraining-page/

详情
AI中文摘要

直接在网络规模语料库上预训练语言模型是当前的主流范式。我们研究了一种替代方案:首先让模型接触抽象结构化数据,以简化后续丰富语义知识的获取,类似于人类在学习高级推理之前先学习简单逻辑和数学。我们关注由形式语言和其他简单算法生成的程序数据作为此类抽象数据。首先,我们诊断了不同形式的程序数据能够提升的算法技能,通常效果显著。例如,当模型在Dyck序列(平衡括号)上预训练时,上下文召回(大海捞针)的准确率从10%跃升至98%。其次,我们研究了这些增益如何反映在更大模型(高达1.3B参数)的预训练中。我们发现,仅在前端加入0.1%至0.3%的程序数据,就能显著优于在自然语言、代码和非正式数学(C4、CodeParrot和DeepMind-Math数据集)上的标准预训练。值得注意的是,这也使得模型仅需原始数据的55/67/86%即可达到相同的损失值,从而相应地减少FLOPs。第三,我们探索了这些收益背后的机制,发现程序化预训练在注意力层和MLP层中都注入了非平凡的结构。前者对于结构化领域(如代码)尤为重要,后者对于语言领域重要。最后,我们为组合多种形式的程序数据铺平了道路。我们的结果表明,程序化预训练是一种简单、轻量级的方法,能够提升性能并加速语言模型预训练,最终揭示了在LLM中将知识获取与推理分离的前景。

英文摘要

Pretraining language models directly on web-scale corpora is the de facto paradigm. We study an alternative where the model is initially exposed to abstract structured data to ease the subsequent acquisition of rich semantic knowledge, much like humans learning simple logic and mathematics before higher reasoning. We focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, the accuracy of context recall (Needle-in-a-haystack) jumps from 10 to 98% when a model is pretrained on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1 to 0.3% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this also enables the models to reach the same loss value with only 55/67/86% of the original data and thus a comparable reduction in FLOPs. Third, we explore the mechanisms behind the benefits and find that procedural pretraining instills non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means of improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

2601.20255 2026-05-29 cs.LG cs.CL cs.SE

HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench

HE-SNR:通过熵揭示潜在逻辑以指导SWE-bench上的中期训练

Yueyang Wang, Jiawei Fu, Baolong Bi, Xili Wang, Xiaoqing Liu

AI总结 针对SWE-bench基准,提出基于熵压缩假说的HE-SNR指标,通过细粒度熵分析指导中期训练数据筛选,在高达560B参数模型上验证有效性。

Comments Accepted at ICML 2026. 21 pages, 15 figures

详情
AI中文摘要

SWE-bench已成为评估大型语言模型在复杂软件工程任务中能力的主要基准。虽然这些能力主要在中期训练阶段获得,并在监督微调(SFT)期间被激发,但目前仍然缺乏能够有效指导中期训练的指标。诸如困惑度(PPL)等标准指标受到“长上下文税”的影响,且与下游SWE性能的相关性较弱。在本文中,我们首先引入严格的数据过滤策略来弥补这一差距。关键地,我们提出了熵压缩假说,将智能重新定义为不是通过标量Top-1压缩,而是通过将不确定性结构化为低阶的熵压缩状态(“合理犹豫”)的能力。基于这种细粒度熵分析,我们制定了一个新的指标,HE-SNR(高熵信噪比)。我们在不同上下文窗口(32K/128K)下对高达560B参数的模型验证了我们的方法。这项工作为优化LLM在复杂工程领域的潜在能力提供了理论基础和实用工具。

英文摘要

SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the "Long-Context Tax" and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders ("reasonable hesitation"). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). We validate our approach on models with up to 560B parameters across different context windows (32K/128K). This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.

2601.18728 2026-05-29 cs.LG math.DG math.OC math.ST stat.TH

Riemannian AmbientFlow: Towards Simultaneous Manifold Learning and Generative Modeling from Corrupted Data

黎曼环境流:面向从损坏数据同时进行流形学习和生成建模

Willem Diepeveen, Oscar Leong

AI总结 提出Riemannian AmbientFlow框架,通过变分推断和数据驱动黎曼几何,从损坏观测中同时学习概率生成模型和非线性数据流形,并理论保证误差可控与双Lipschitz流形参数化。

详情
AI中文摘要

现代生成建模方法在从干净样本学习复杂数据分布方面表现出强大性能。然而,在许多科学和成像应用中,干净样本不可用,只能观测到噪声或线性损坏的测量值。此外,数据中存在的潜在结构(如流形几何)对于进一步的科学分析至关重要。在这项工作中,我们引入了Riemannian AmbientFlow,一个直接从损坏观测中同时学习概率生成模型和底层非线性数据流形的框架。基于AmbientFlow的变分推断框架,我们的方法结合了由归一化流引起的数据驱动黎曼几何,通过拉回度量和黎曼自编码器提取流形结构。我们建立了理论保证,表明在适当的几何正则化和测量条件下,学习到的模型以可控误差恢复底层数据分布,并产生光滑的双Lipschitz流形参数化。我们进一步证明,所得的光滑解码器可以作为具有恢复保证的逆问题的原则性生成先验。我们在低维合成流形和MNIST上实证验证了我们的方法。

英文摘要

Modern generative modeling methods have demonstrated strong performance in learning complex data distributions from clean samples. In many scientific and imaging applications, however, clean samples are unavailable, and only noisy or linearly corrupted measurements can be observed. Moreover, latent structures, such as manifold geometries, present in the data are important to extract for further downstream scientific analysis. In this work, we introduce Riemannian AmbientFlow, a framework for simultaneously learning a probabilistic generative model and the underlying, nonlinear data manifold directly from corrupted observations. Building on the variational inference framework of AmbientFlow, our approach incorporates data-driven Riemannian geometry induced by normalizing flows, enabling the extraction of manifold structure through pullback metrics and Riemannian Autoencoders. We establish theoretical guarantees showing that, under appropriate geometric regularization and measurement conditions, the learned model recovers the underlying data distribution up to a controllable error and yields a smooth, bi-Lipschitz manifold parametrization. We further show that the resulting smooth decoder can serve as a principled generative prior for inverse problems with recovery guarantees. We empirically validate our approach on low-dimensional synthetic manifolds and on MNIST.

2601.12699 2026-05-29 cs.LG cs.SY eess.SY

Bandit Algorithms for Deep Brain Stimulation

深度脑刺激的赌博机算法

Arkaprava Gupta, Nicholas Carter, William Zellers, Prateek Ganguli, Benedikt Dietrich, Vibhor Krishna, Parasara Sridhar Duggirala, Samarjit Chakraborty

AI总结 提出基于时间与阈值触发的剪枝多臂赌博机算法,无需离线训练,在抑制病理性β波段活动和降低刺激能耗方面优于深度强化学习方法,并验证了其在资源受限植入式系统上的可行性。

Comments Accepted to the ACM/IEEE 17th International Conference on Cyber-Physical Systems (ICCPS) 2026

详情
AI中文摘要

深度脑刺激(DBS)是帕金森病的有效治疗方法,但传统的固定参数刺激会降低电池寿命并引起副作用,同时无法适应变化的神经动力学。最近的强化学习方法提高了适应性,但大多数依赖深度神经网络,需要离线训练且计算成本过高,不适合植入式硬件。本文提出了一种基于时间与阈值触发的剪枝多臂赌博机(T3P MAB)算法的资源意识自适应DBS框架。该方法联合调节刺激频率和幅度,避免预先训练,并且足够透明以支持临床医生指导的调整。使用计算基底节-丘脑模型,我们展示了T3P比竞争的MAB方法收敛更快,并且在抑制病理性β波段活动方面优于深度强化学习基线,同时降低刺激功率。我们在不同的微控制器上实现了该方法,并报告了详细的能量测量,显示在不到两分钟内收敛,适合资源受限的植入式系统。这些结果支持轻量级赌博机控制作为实现个性化、节能DBS的实用途径。

英文摘要

Deep Brain Stimulation (DBS) is an effective treatment for Parkinson's disease, but conventional fixed-parameter stimulation can reduce battery life and cause side effects while failing to adapt to changing neural dynamics. Recent reinforcement learning approaches improve adaptability, yet most rely on deep neural networks that require offline training and are computationally too expensive for implantable hardware. This paper presents a resource-conscious adaptive DBS framework based on a Time- and Threshold-Triggered Pruned Multi-Armed Bandit (T3P MAB) algorithm. The proposed method jointly tunes stimulation frequency and amplitude, avoids prior training, and remains transparent enough to support clinician-guided adjustment. Using a computational basal ganglia-thalamic model, we show that T3P converges faster than competing MAB methods and outperforms deep-RL baselines in suppressing pathological beta-band activity while reducing stimulation power. We implemented it on different microcontrollers and report detailed energy measurements, showing convergence in under two minutes and suitability for resource-constrained implantable systems. These results support lightweight bandit-based control as a practical path toward personalized, energy-efficient DBS.

2601.10960 2026-05-29 cs.CL cs.AI

Steering Language Models Before They Speak: Logit-Level Interventions

在语言模型发言前引导它们:Logit 级别的干预

Hyeseon An, Shinwoo Park, Hyundong Jin, Yo-Sub Han

AI总结 提出 SWAI 方法,通过基于语料库的 token 统计在 logit 空间直接引导语言模型,无需训练或访问内部激活,在可读性、礼貌性和毒性控制上优于提示和基线方法。

Comments preprint

详情
AI中文摘要

可控生成要求语言模型实现诸如阅读水平、礼貌性和毒性等输出特征。现有的引导方法通常间接、需要访问内部激活或依赖辅助训练模型。我们提出 SWAI,一种无需训练、推理时的方法,通过使用基于语料库的 token 统计直接在 logit 空间引导,解决了这些限制。SWAI 从标记语料库计算 z 归一化的一对多 log-odds 分数,并仅在模型的前 K 个候选集内偏向高分数 token,从而在保留上下文合理选择的同时允许控制偏向目标特征 token。在可读性、礼貌性和毒性控制方面,SWAI 在不修改模型参数、访问内部层或训练辅助模型的情况下,始终优于基于提示和先前的 logit 级别基线。选择性和查找表消融实验表明,增益来自目标特定的统计分数,而非通用 logit 扰动。这些结果表明,当 logit 干预在高概率候选下由目标特定统计引导时,有效的引导不需要学习控制器。

英文摘要

Controllable generation requires language models to realize output characteristics such as reading level, politeness, and toxicity. Existing steering methods are often indirect, require access to internal activations, or depend on auxiliary trained models. We propose SWAI, a training-free inference-time method that addresses these limitations by steering directly in logit space using corpus-derived token statistics. SWAI computes z-normalized one-vs-rest log-odds scores from labeled corpora and biases high-scoring tokens only within the model's top-K candidate set, allowing control to favor target-characteristic tokens while preserving contextually plausible choices. Across readability, politeness, and toxicity control, SWAI consistently improves over prompt-based and prior logit-level baselines without modifying model parameters, accessing internal layers, or training an auxiliary model. Selectivity and lookup-table ablations show that the gains come from target-specific statistical scores rather than generic logit perturbation. These results indicate that effective steering does not require learned controllers when the logit intervention is guided by target-specific statistics under high-probability candidates.

2601.08654 2026-05-29 cs.CL cs.AI cs.LG

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

从评分标准到可靠分数:基于证据的文本评估与LLM裁判

Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong

AI总结 提出Rulers框架,通过三阶段推理(任务规范、结构化执行、事后校准)解决LLM在基于评分标准的文本评估中的执行漂移、归因不可验证和人类尺度错位问题,实现更可靠的评分。

详情
AI中文摘要

基于评分标准的文本评估越来越多地使用大型语言模型(LLM)作为可扩展的裁判,但将冻结的黑盒模型与人类评分标准对齐仍然具有挑战性。我们将这一挑战表述为一个标准迁移问题:目标不仅仅是提示LLM分配分数,而是将人类评分标准意图转移到一个稳定、可审计且与人类对齐的评分协议中。我们识别了基于LLM的评分标准评估中三种反复出现的失败模式:评分标准执行漂移、不可验证的分数归因和人类尺度错位。为了解决这些失败模式,我们引入了Rulers,一个三阶段推理时框架,用于可靠、基于证据的评分标准文本评估。Rulers首先将人类评分标准转换为锁定的任务级规范,然后通过结构化检查表决策、类型化证据基础以及在适用时进行可提取引用验证来执行该规范,最后应用事后校准以将模型衍生的信号与人类分数边界对齐。在涵盖论文评分、摘要评估、EFL写作评估和结构化输入文本生成的四个基于评分标准的基准测试中,Rulers在多个冻结骨干模型的大多数评估设置中实现了更强的人类分数一致性。进一步分析表明,Rulers更好地匹配了经验人类分数分布,提高了在语义等价评分标准扰动下的稳定性,并受益于其三个组成部分。这些结果表明,可靠的LLM评判需要固定标准、可追溯证据和校准的分数解释,而不仅仅是提示措辞。我们的代码可在 https://anonymous.4open.science/r/Rulers_0525-3328 获取。

英文摘要

Rubric-based text evaluation increasingly uses large language models (LLMs) as scalable judges, but aligning frozen black-box models with human scoring standards remains challenging. We formulate this challenge as a criteria-transfer problem: the goal is not merely to prompt an LLM to assign a score, but to transfer human rubric intent into a stable, auditable, and human-aligned scoring protocol. We identify three recurring failure modes in LLM-based rubric scoring: rubric execution drift, unverifiable score attribution, and human-scale misalignment. To address these failure modes, we introduce Rulers, a three-stage inference-time framework for reliable, evidence-grounded rubric-based text evaluation. Rulers first converts a human rubric into a locked task-level specification, then executes the specification with structured checklist decisions, typed evidence grounding, and extractive quote verification when applicable, and finally applies post-hoc calibration to align model-derived signals with human score boundaries. Across four rubric-governed benchmarks covering essay scoring, summarization assessment, EFL writing evaluation, and structured-input text generation, Rulers achieves stronger human-score agreement in most evaluated settings across multiple frozen backbone models. Further analyses show that Rulers better matches empirical human score distributions, improves stability under semantically equivalent rubric perturbations, and benefits from each of its three components. These results suggest that reliable LLM judging requires fixed criteria, traceable evidence, and calibrated score interpretation rather than prompt phrasing alone. Our code is available at https://anonymous.4open.science/r/Rulers_0525-3328.

2601.08064 2026-05-29 cs.CL

Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations

校准还不够:评估语言变化下的置信度估计

Yuxi Xia, Dennis Ulmer, Terra Blevins, Yihong Liu, Hinrich Schütze, Benjamin Roth

AI总结 提出一个基于鲁棒性、稳定性和敏感性的新评估框架,揭示现有置信度估计方法在区分语义不同答案方面的不足。

详情
AI中文摘要

置信度估计(CE)指示大型语言模型答案的可靠性,影响用户信任和决策。现有评估主要关注置信度与正确性之间的一致性,但忽略了语言的可变性:置信度估计应在语义等价的提示或答案变体下保持一致,而在答案含义不同时发生变化,因为这可能表明正确性的变化。因此,我们引入了一个基于三个互补属性的新评估框架:对提示扰动的 extbf{鲁棒性}、跨语义等价答案的 extbf{稳定性}以及对语义不同答案的 extbf{敏感性}。我们表明这些指标与现有CE指标在很大程度上独立,并且常见的CE方法往往在这些指标上失败:虽然大多数方法实现了高鲁棒性和稳定性,但它们难以区分语义不同的答案,可能是因为它们没有有效利用生成侧信息。总体而言,我们的框架揭示了当前CE评估中被忽视的局限性,并为现实应用中选择置信度估计器提供了指导。

英文摘要

Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers. We show that these metrics are largely independent from existing CE metrics, and that common CE methods often fail on them: while most methods achieve high robustness and stability, they struggle to distinguish semantically different answers, potentially because they do not effectively leverage generation-side information. Overall, our framework exposes overlooked limitations of current CE evaluations and provides guidance for selecting confidence estimators for real-world applications.