arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.26026 2026-05-26 cs.CV cs.AI cs.LG

A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring

一种用于光片荧光显微镜的多模态3D基础模型实现少样本分割、分类和去模糊

Adina Scheinfeld, Haotan Zhang, Shang Mu, Rudolf L. M. van Herten, Lucas Stoffl, Ali Erturk, Zhuhao Wu, Johannes C. Paetzold

AI总结 提出一种基于掩码重建与图像-文本对齐联合优化的3D基础模型,在光片荧光显微镜数据上预训练,通过少样本适应显著降低标注成本并提升分割、分类和去模糊性能。

详情
Comments
11 pages, 3 figures
AI中文摘要

光片荧光显微镜(LSM)能够对生物样本进行高分辨率三维(3D)成像,提供丰富的体积数据用于研究细胞组织、病理学和血管网络。然而,LSM数据的大小、维度和标注负担使得监督深度学习方法成本高昂且难以扩展。此外,尽管存在大量未标注的LSM体积数据,但由于计算挑战和体积表示学习的复杂性,针对该模态的基础模型仍未得到充分探索。在这项工作中,我们引入了一个用于LSM数据的3D基础模型,该模型在涵盖多种生物体、染色和成像协议的大型精选3D图像集合上进行了预训练。通过联合优化掩码重建和图像-文本对齐,我们学习了可迁移的体积表示。预训练骨干网络大幅降低了标注负担,实现了针对多种下游任务的高效少样本适应。我们在下游分割、分类和去模糊任务上评估了该方法。结果表明,我们的方法在(1)使用标准评估指标衡量时以及(2)经过领域专家严格评估时,均持续优于基线。这凸显了基础模型预训练在减少标注需求的同时提升多样化LSM分析任务性能的潜力。预训练模型权重以及预训练和微调的代码已公开:https://github.com/AdinaScheinfeld/lsm_fm_public_repo.git。

英文摘要

Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich volumetric data for studying cellular organization, pathology, and vascular networks. However, the size, dimensionality, and annotation burden of LSM data make supervised deep learning approaches costly and difficult to scale. Additionally, despite the abundance of unannotated LSM volumes, foundation models for this modality remain underexplored due to computational challenges and the complexity of volumetric representation learning. In this work, we introduce a 3D foundation model for LSM data, pretrained on a large curated collection of 3D images spanning multiple organisms, stains, and imaging protocols. We learn transferable volumetric representations by jointly optimizing for masked reconstruction and image-text alignment. The pretrained backbone drastically reduces the annotation burden, enabling efficient, few-shot adaptation for varied downstream tasks. We evaluate this approach on downstream segmentation, classification, and deblurring. Our results demonstrate consistent improvements over baselines, (1) when measured using standard evaluation metrics and (2) when rigorously assessed by domain experts. This highlights the potential of foundation model pretraining to reduce annotation requirements while improving performance across diverse LSM analysis tasks. Pretrained model weights and code for pretraining and finetuning are publicly available: https://github.com/AdinaScheinfeld/lsm_fm_public_repo.git.

2605.26019 2026-05-26 cs.LG cs.AI cs.CL

Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service

智利服务条款中潜在滥用条款的检索增强检测

Christoffer Loeffler, Tomás Rey Pizarro, Daniel Ignacio Miranda Vásquez, Andrea Martínez Freile

AI总结 提出检索增强生成框架,结合混合稠密-稀疏检索与提示增强,用于自动检测和分类智利服务条款中的潜在滥用条款,并引入包含100份合同和10,029条标注条款的语料库,实验表明该方法显著提升性能,使本地模型接近云端系统。

详情
Comments
42 pages, 6 figures, 9 tables
AI中文摘要

在线服务条款通常作为附意合同运作,造成不对称性,可能使消费者面临潜在滥用条款。在智利,评估此类条款在法律上具有挑战性,因为某些条款明显违反强制性消费者法律,而其他条款则依赖于更广泛的标准,如诚信和合同失衡。我们提出一个检索增强生成框架,用于自动检测和分类智利服务条款中的潜在滥用条款。该框架设计为本地执行,结合了高效条款检测、混合稠密-稀疏检索、重排序和提示增强,以支持中等规模的开源语言模型。我们还引入了智利滥用服务条款扩展语料库,包含100份合同和10,029条标注条款,涵盖24个法律基础的类别,包括非法、黑暗和灰色条款。比较商业和开源语言模型、微调编码器以及传统基线的实验表明,检索增强提示显著提高了性能,并使本地模型能够以较低的计算和令牌成本接近更大的基于云的系统。该研究还贡献了一个精细的法律注释方案和一个用于AI辅助消费者合同审查的实用设计。

英文摘要

Online Terms of Service often function as contracts of adhesion, creating asymmetries that may expose consumers to potentially abusive clauses. In Chile, assessing such clauses is legally challenging because some provisions clearly violate mandatory consumer law, whereas others depend on broader standards such as good faith and contractual imbalance. We present a retrieval-augmented generation framework for the automated detection and classification of potentially abusive clauses in Chilean Terms of Service. Designed for local execution, it combines efficient clause detection, hybrid dense--sparse retrieval, reranking, and prompt augmentation to support medium-sized open-weight language models. We also introduce the Chilean Abusive Terms of Service Extended corpus, comprising 100 contracts and 10,029 annotated clauses in 24 legally grounded categories spanning illegal, dark, and gray clauses. Experiments comparing commercial and open-weight language models, fine-tuned encoders, and traditional baselines show that retrieval-augmented prompting substantially improves performance and enables local models to approach larger cloud-based systems at lower computational and token cost. The study also contributes a refined legal annotation scheme and a practical design for AI-assisted consumer contract review.

2605.26014 2026-05-26 cs.CV cs.CL

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

STORM: 视频语言模型中时空推理的内化建模

Yiming Liang, Yixiao Chen, Yiyang Zhou, Yixuan Wang, Shoubin Yu, Andong Deng, Fuxiao Liu, Qin Zhang, Chen Chen, Mohit Bansal, Huaxiu Yao

AI总结 提出STORM框架,通过有界连续潜在轨迹内化推理过程,无需显式文本思维链或外部工具,提升视频推理准确性并降低推理开销。

详情
AI中文摘要

许多视频推理任务需要跨帧跟踪运动、时间顺序和演化的视觉状态。基于大型视觉语言模型(LVLMs)的现有方法通常通过文本思维链(CoT)、关键帧选择、重复帧插入或外部工具使用来外化推理。虽然有效,但此类流水线增加了推理延迟和工程复杂性,并迫使时间-视觉证据被序列化为文本或从帧中重复重新编码。受视觉推理可以在语言化之前隐式发生的直觉启发,我们提出STORM(通过内化建模的时空推理),一个两阶段框架,教导LVLMs通过有界连续潜在轨迹进行推理,而不是显式文本CoT。在第一阶段,STORM将潜在令牌与从生成视频中衍生的思想-视频表示对齐,将潜在状态基于动态视觉证据。在第二阶段,模型进一步通过仅答案监督训练,鼓励推理过程内化而无需逐步注释。生成的思想视频仅在训练期间使用;在推理时,STORM执行有界潜在展开,无需重新生成视频、重新插入帧或调用外部视觉工具。在VideoMME、MVBench、TempCompass和MMVU上的实验表明,与基于工具或视频生成的推理流水线相比,STORM提高了视频推理准确性,同时显著降低了推理开销。

英文摘要

Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning through textual chain-of-thought (CoT), keyframe selection, repeated frame reinsertion, or external tool use. While effective, such pipelines increase inference-time latency and engineering complexity, and they force temporal-visual evidence to be serialized into text or repeatedly re-encoded from frames. Inspired by the intuition that visual reasoning can occur implicitly before verbalization, we propose STORMS (Spatial-Temporal reasOning via inteRnalized Modeling), a two-stage framework that teaches LVLMs to reason through bounded continuous latent trajectories instead of explicit textual CoT. In Stage I, STORMS aligns latent tokens with thought-video representations derived from generated videos, grounding the latent states in dynamic visual evidence. In Stage II, the model is further trained with answer-only supervision, encouraging the reasoning process to be internalized without step-by-step annotations. Generated thought videos are used only during training; at inference, STORMS performs a bounded latent rollout without regenerating videos, reinserting frames, or invoking external visual tools. Experiments on VideoMME, MVBench, TempCompass, and MMVU show that STORMS improves video reasoning accuracy while substantially reducing inference overhead compared with tool or video-generation-based reasoning pipelines.

2605.26013 2026-05-26 cs.LG cs.AI cs.CV

AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models

AdvantageFlow: 流模型中基于优势加权的强化学习最小二乘法

Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, Krishna Kumar Singh, Viet Dac Lai

AI总结 提出AdvantageFlow算法,通过优势加权前向过程预测损失和 rollout 策略正则化,在图像生成任务中优于Flow-GRPO和负感知微调基线。

详情
AI中文摘要

我们引入了AdvantageFlow,一种用于修正流模型的前向过程强化学习算法。与优化反向过程的Flow-GRPO不同,我们优化了一个优势加权的前向过程预测损失。当优势为负且损失变为非凸时,该优化问题不稳定。我们通过rollout策略正则化来稳定它,这降低了方差,并源于拟合局部奖励改进的目标分布。我们在Stable Diffusion 3.5 Medium上评估了AdvantageFlow在图像生成任务中的表现。它优于Flow-GRPO和基于负感知微调的最先进前向过程强化学习基线。

英文摘要

We introduce AdvantageFlow, a forward-process reinforcement learning algorithm for rectified flow models. Unlike Flow-GRPO, which optimizes the reverse process, we optimize an advantage-weighted forward-process prediction loss. This optimization problem is unstable when advantages are negative and the loss becomes non-convex. We stabilize it by rollout policy regularization, which reduces variance and arises from fitting a local reward-improving target distribution. We evaluate AdvantageFlow on image generation tasks with Stable Diffusion 3.5 Medium. It outperforms both Flow-GRPO and a state-of-the-art forward-process RL baseline based on negative-aware fine-tuning.

2605.26012 2026-05-26 cs.LG cs.AI

Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

低维子空间中的学习:强化学习的正交瓶颈

Aleksandar Todorov, Matthia Sabatelli

AI总结 提出一种在强化学习编码器特征中插入固定正交投影以约束低维子空间的简单先验,证明其在线性可实现性假设下保持表达能力,并在实验中显示价值表示可压缩至极低维度而不损失性能。

详情
AI中文摘要

深度强化学习代理通常依赖高维神经表示,尽管越来越多的证据表明任务相关的价值和策略结构本质上是低维的。在这项工作中,我们提出了一种简单而有效的表示级先验,它插入一个固定的正交投影以将编码器特征约束到低维子空间,无需辅助目标、预训练或对底层RL算法的更改。在线性可实现性假设下,我们证明当瓶颈维度超过特征空间中最优价值函数的内在秩时,瓶颈保持表达能力,并将诱导的梯度动力学保留到等价的低维参数化。实验上,我们发现,在单任务和多任务基准测试中,一旦瓶颈维度超过一个小的任务相关阈值,基线性能要么匹配要么提高;在许多情况下,价值表示可以压缩到极低维度而不损失,最小充分维度更多地取决于环境复杂性而非编码器宽度。此外,我们分析了表示几何,发现正交瓶颈稳定了特征范数,并与更高的有效秩相关。这些结果共同支持了强化学习中流形假设的表示空间解释,并将正交瓶颈定位为一种轻量级、架构无关的塑造RL表示的机制。

英文摘要

Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-relevant value and policy structure may be intrinsically low-dimensional. In this work, we present a simple yet effective representation-level prior that inserts a fixed orthonormal projection to constrain encoder features to a low-dimensional subspace, requiring no auxiliary objectives, pretraining, or changes to the underlying RL algorithm. Under a linear realizability assumption, we prove that when the bottleneck dimension exceeds the intrinsic rank of the optimal value function in feature space, the bottleneck preserves expressivity and leaves the induced gradient dynamics unchanged up to an equivalent low-dimensional parameterization. Empirically, we find that across both single and multi-task benchmarks, baseline performance is either matched or improved once the bottleneck dimension exceeds a small task-dependent threshold; in many cases, value representations can be compressed to extremely low dimensions without loss, and the minimal sufficient dimension depends far more on environment complexity than encoder width. In addition, we analyze representation geometry and find that orthogonal bottlenecks stabilize feature norms and are associated with higher effective rank. Together, these results support a representation-space interpretation of the manifold hypothesis in reinforcement learning and position orthogonal bottlenecks as a lightweight, architecture-agnostic mechanism for shaping RL representations.

2605.26007 2026-05-26 cs.CL

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

遗忘的词语:在低资源菲律宾语和英语对话中用于痴呆检测的NeoBERT基准测试

Rez Samantha Z. Floresca, Edric Castel C. Hao, Hannah Grachiella Buñales, Chelsea Dominique E. Temprosa, Georgianna Z. Reyes, Kervin Gabriel L. Chua

AI总结 针对菲律宾语-英语代码混合的低资源场景,首次系统评估基于Transformer的痴呆检测模型,发现双语微调可消除跨语言性能下降,达到Macro-F1=0.969-0.973。

详情
Comments
Accepted to BioNLP Workshop @ ACL 2026
AI中文摘要

从自发语音中检测痴呆症提供了一种可扩展的认知筛查方法,但NLP系统仍然以英语为中心。这一限制在菲律宾尤为严重,因为菲律宾语-英语代码混合普遍存在,且尚无先前工作涉及基于NLP的痴呆检测。我们首次对基于Transformer的菲律宾语音频痴呆检测进行了系统评估,并首次在临床NLP环境中评估了NeoBERT。为了将语言与领域效应分离,我们构建了一个包含4,000个DementiaBank衍生转录本的平行双语数据集,其中菲律宾语翻译由人工完成,以保留认知衰退的话语层面标记。我们在单语、零样本跨语言和双语微调设置下评估了五个模型家族:TF-IDF + LogReg、BERT、NeoBERT、XLM-R和RoBERTa-Tagalog。我们发现,领域内性能无法跨语言迁移,英语训练的BERT在菲律宾语上Macro-F1降至0.455,且仅靠架构现代化并不能提高鲁棒性。然而,双语微调消除了所有Transformer模型的跨语言性能下降,收敛到Macro-F1=0.969-0.973。这些结果表明,多语言临床NLP性能主要受训练期间的语言覆盖范围驱动,而非模型规模或架构。

英文摘要

Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino-English code-switching is pervasive and no prior work has addressed NLP-based dementia detection. We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969-0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.

2605.26004 2026-05-26 cs.CV cs.CL

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC: 面向视觉语言模型的多模态对齐与接地感知指令核心集

Shristi Das Biswas, Kaushik Roy

AI总结 提出MAGIC方法,利用预训练VLM中的多模态增益、桥接相关性和技能神经元签名三种内在信号,通过无训练、前向传播的核心集选择,构建紧凑且行为保真的子集用于多模态指令微调,在20%预算下达到甚至超越全微调性能。

详情
AI中文摘要

大型视觉语言模型(LVLMs)的指令微调越来越依赖于大规模多模态语料库,然而这些数据集包含大量冗余、低视觉依赖性以及多模态推理行为覆盖极不平衡的样本。因此,均匀子采样或基于分数的朴素选择往往产生次优的训练子集。我们提出MAGIC,一种无需训练、仅前向传播的核心集选择方法,旨在为多模态指令微调构建紧凑且行为保真的子集。MAGIC基于从预训练VLM中提取的三个内在信号:多模态增益,衡量从视觉输入获得的似然改进;桥接相关性,捕捉答案令牌在视觉令牌上的接地锐度;以及技能神经元签名,通过顶部激活的前馈神经元表征每个样本引发的功能计算。MAGIC通过三阶段流程组合这些信号:过滤低增益样本,通过归一化质量目标对候选样本排序,并在离散神经元签名上执行桶式预算分配以保留潜在的多模态技能覆盖。该公式避免了反向传播、辅助选择器训练以及连续激活空间中的昂贵聚类,同时保持高效且易于部署在现有VLM中。在LLaVA-665K和Vision-Flan数据集上,以及向大型目标模型LLaVA-1.5-7B和-13B的迁移设置中,MAGIC在匹配的20%预算下持续优于强基线:在LLaVA-665K上达到全微调相对性能的100.3%,在Vision-Flan-186K上达到101.6%,同时减少了73.7%的挂钟运行时间。

英文摘要

Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.

2605.26003 2026-05-26 cs.CV

Towards 3D heart mesh generation using contactless radar imaging and physics-informed neural network

基于非接触式雷达成像和物理信息神经网络的3D心脏网格生成

Jinye Li, Chenxi Fu, Minghang Zheng, Yang Liu, Xiahai Zhuang, Qingchao Chen

AI总结 提出SAR2Mesh框架,通过粗到细的网格变形过程,结合几何感知特征投影和物理信息雷达损失,从合成孔径雷达图像重建高保真3D心脏几何结构。

详情
AI中文摘要

心脏功能评估需要连续、无创的监测,而MRI在这方面的能力有限。毫米波雷达及其合成孔径雷达模式提供了一种保护隐私且便携的即时临床检测应用。然而,从SAR重建高保真3D心脏几何结构仍然是一个开放挑战。传统雷达方法生成稀疏点云,缺乏连续表面拓扑。同时,由于SAR图像中严重的散斑噪声和模糊边界,直接应用光学重建网络效果不佳。为了弥合这一差距,我们提出了SAR2Mesh,一种将任务重新表述为粗到细网格变形过程的新框架。通过用拓扑模板初始化,我们的方法通过渐进网格变形明确保留解剖连通性。我们引入了几何感知特征投影模块,通过3D到2D采样提取多视图特征,以及物理信息雷达损失,以强制预测几何与原始雷达回波之间的一致性。此外,我们提出了Cardiac Mesh-SAR,第一个大规模配对SAR-网格数据集。大量实验表明,SAR2Mesh显著优于现有的基于图像的基线,实现了准确且物理一致的心脏重建。

英文摘要

Cardiac function evaluation necessitates continuous, non-invasive monitoring, a capability limited in MRI. Millimeter-wave (mmWave) radar and its Synthetic Aperture Radar (SAR) mode offer a privacy-preserving and portable point-of-care clinical applications. However, reconstructing high-fidelity 3D cardiac geometry from SAR remains an open challenge. Traditional radar methods generate sparse point clouds that lack continuous surface topology. Meanwhile, direct application of optical reconstruction networks performs poorly due to the severe speckle noise and ambiguous boundaries inherent in SAR images. To bridge this gap, we propose SAR2Mesh, a novel framework that reformulates the task as a coarse-to-fine mesh deformation process. By initializing with a topological template, our approach explicitly preserves anatomical connectivity through progressive mesh deformation.We introduce a geometry-aware feature projection module to extract multi-view features via 3D-to-2D sampling, and a physics-informed radar loss to enforce consistency between the predicted geometry and raw radar echoes. Furthermore, we present Cardiac Mesh-SAR, the first large-scale paired SAR-mesh dataset. Extensive experiments demonstrate that SAR2Mesh significantly outperforms existing image-based baselines, achieving accurate and physically consistent cardiac reconstructions.

2605.26001 2026-05-26 cs.CL cs.AI cs.CY

AI-Assisted Systematization for Evaluating GenAI Systems

AI辅助的系统化方法用于评估生成式AI系统

Dhruv Agarwal, Emily Sheng, Chad Atalla, Jean Garcia-Gathright, Hussein Mozannar, Hannah Washington, Alexandra Chouldechova, Solon Barocas, Hanna Wallach

AI总结 针对生成式AI评估中概念模糊的问题,提出AI辅助系统化方法,通过概念规范和验证工作表生成可衡量的概念规范,并评估其内容效度和信息可恢复性。

详情
AI中文摘要

评估生成式AI(GenAI)系统具有挑战性,因为许多评估目标都是宽泛且有争议的概念,例如“推理”、“公平性”或“创造力”。当这些概念未得到充分明确时,就不清楚应该测量什么或如何解释评估结果。这个问题反映了一个缺失的步骤:系统化,即从一个宽泛的背景概念转变为用可衡量术语对概念进行明确、结构化的描述。为了帮助解决系统化在认知上要求高且资源密集的问题,我们研究了AI辅助是否能够支持这一过程。为了实现AI辅助的系统化并评估其质量,我们引入了系统化概念的结构化表示——概念规范——以及一个验证工作表。然后,我们开发了两种AI辅助系统化工具:一种直接的零样本方法和一种多智能体方法,后者更贴近现有文献中手动系统化的方法。我们使用这些系统化工具为两个概念——仇恨言论和数字共情——生成概念规范,并评估所得概念规范的内容效度和信息可恢复性。

英文摘要

Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasoning," "fairness," or "creativity." When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be interpreted. This problem reflects a missing step: systematization, that is, moving from a broad background concept to an explicit, structured account of the concept in measurable terms. To help address the fact that systematization is cognitively demanding and resource-intensive, we investigate whether AI assistance can support this process. To enable AI-assisted systematization and assess its quality, we introduce a structured representation of a systematized concept, a concept spec, and a validation worksheet. We then develop two AI-assisted systematizers: a direct, zero-shot approach and a multi-agent approach that more closely mirrors manual systematization approaches from existing literature. We use these systematizers to produce concept specs for two concepts -- hate-based rhetoric and digital empathy -- and evaluate resulting concept specs on content validity and information recoverability.

2605.26000 2026-05-26 stat.ML cs.LG stat.ME

Statistical Inference for Stochastic Gradient Descent Beyond Finite Variance

超越有限方差的随机梯度下降统计推断

Jose Blanchet, Peter Glynn, Wenhao Yang

AI总结 针对随机梯度下降中梯度方差可能无限的问题,提出一种基于联合弱收敛和自正则化统计量的模型无关置信域构建方法,并通过子采样校准实现渐近有效推断。

详情
AI中文摘要

随机梯度下降(SGD)是大规模统计学习和随机优化的基础算法。然而,当随机梯度具有无限方差时,基于SGD迭代的统计推断仍然具有挑战性,因为相关的极限分布依赖于未知的冗余参数。在本文中,我们开发了一种高效、模型无关的方法,用于从SGD轨迹构建置信域,该方法适用于有限方差和无限方差两种情况。该过程基于Polyak-Ruppert平均估计量和由SGD轨迹上的随机梯度构建的经验二阶矩归一化器的联合弱收敛结果。这种联合极限产生了一个自归一化统计量,其中主要的尾部依赖尺度项相互抵消。然后,我们使用子采样校准方案来估计相关的临界值,避免了对尾部指数、慢变函数或稳定律参数的显式估计。由此产生的置信域易于实现,并且在有限二阶矩和无限二阶矩情况下都是渐近有效的。模拟研究显示了在各种设置下的可靠覆盖,支持所提出的方法作为随机优化中不确定性量化的实用工具。

英文摘要

Stochastic gradient descent (SGD) is a foundational algorithm for large-scale statistical learning and stochastic optimization. However, statistical inference based on SGD iterates remains challenging when stochastic gradients have infinite variance, as the relevant limiting distributions depend on unknown nuisance parameters. In this paper, we develop an efficient, model-agnostic methodology for constructing confidence regions from SGD trajectories that applies in both finite- and infinite-variance regimes. The procedure is based on a joint weak convergence result for the Polyak-Ruppert averaged estimator and an empirical second-moment normalizer constructed from stochastic gradients along the SGD trajectory. This joint limit yields a self-normalized statistic in which the leading tail-dependent scaling terms cancel. We then use a subsampling calibration scheme to estimate the relevant critical values, avoiding explicit estimation of tail indices, slowly varying functions, or stable-law parameters. The resulting confidence regions are straightforward to implement and are asymptotically valid under both the finite- and infinite-second-moment regimes. Simulation studies show reliable coverage in various settings, supporting the proposed method as a practical tool for uncertainty quantification in stochastic optimization.

2605.25998 2026-05-26 cs.LG

Causal methods for LLM development and evaluation

因果方法在LLM开发与评估中的应用

Dennis Frauen, Marie Brockschmidt, Konstantin Hess, Haorui Ma, Yuchen Ma, Abdurahman Maarouf, Maresa Schröder, Jonas Schweisthal, Yuxin Wang, Athiya Deviyani, Sonali Parbhoo, Rahul G. Krishnan, Stefan Feuerriegel

AI总结 本文提出因果方法可解决LLM开发与评估中的关键因果问题,并系统梳理其在预训练、对齐、路由等环节的应用机会。

详情
Comments
Published in KDD 2026
AI中文摘要

大型语言模型(LLM)的开发目前由数据混合、奖励模型、路由策略和评估流程的大规模经验迭代驱动。本文认为,LLM开发和评估中的许多核心问题本质上是因果性的:在预训练中添加数据域会产生什么影响?当LLM以不同风格生成文本时,注释者的偏好如何变化?在推理成本约束下,提示应路由到更大还是更小的模型?通常,因果方法非常适合这种干预改变结果的情景,但令人惊讶的是,它们在LLM开发中代表性不足。我们的贡献有三方面:(1)我们解释了因果方法如何帮助现代LLM开发和评估:LLM开发严重依赖日志数据,这些数据通常受混杂和分布偏移影响;评估使用学习到的但可能有偏见的评判者;部署环境是非平稳的。这些条件使得纯预测方法变得脆弱,并为因果推断中的原则性识别和估计方法创造了机会。(2)我们进一步映射了因果方法在整个LLM开发流程中的机会,包括预训练、对齐、路由、智能体工作流和评估。(3)我们讨论了利用因果方法进行LLM开发和评估的新研究机会。总体而言,我们认为因果方法在LLM开发和评估流程中可能未被充分利用,尽管这些方法可以确保可靠且科学合理的设计。

英文摘要

Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What is the effect of adding a data domain during pretraining? How do annotator preferences change when LLMs generate text in a different style? Should a prompt be routed to a larger or smaller model given inference cost constraints? In general, causal methods are well-suited to such settings where interventions change outcomes but, surprisingly, are underrepresented in LLM development. Our contribution is threefold: (1) We explain how causal methods can help develop modern LLM development and evaluation: LLM development relies heavily on logged data, which are often subject to confounding and distribution shifts; evaluation uses learned but potentially biased judges; and deployment environments are non-stationary. These conditions make purely predictive approaches fragile and create opportunities for principled identification and estimation methods from causal inference. (2) We further map opportunities for causal methods in the entire LLM development pipeline, including pretraining, alignment, routing, agentic workflows, and evaluation. (3) We discuss new research opportunities around leveraging causal methods for LLM development and evaluation. Overall, we argue that causal methods are potentially underutilized for the LLM development and evaluation pipeline, despite the fact that such methods can ensure a reliable and scientifically grounded design.

2605.25997 2026-05-26 cs.LG stat.ML

Deployment-complete benchmarking

部署完备的基准测试

El Mustapha Mansouri, Keigo Arai

AI总结 提出部署完备的基准测试框架,通过证据纤维和完成曲线量化基准证据是否足以确定部署行动,并证明仅靠分数不足以支持部署决策。

详情
Comments
33 pages, 5 figures, 1 table; supplementary tables and code available
AI中文摘要

基准测试日益指导部署、采购和科学筛选,但分数仅支持其记录的反应,不一定支持部署行动。我们引入了部署完备的基准测试,测试基准证据是否确定部署行动。当行动在每个证据纤维上恒定时,基准对于某个声明是完备的;混合纤维暴露了缺失的部署信息,完成曲线量化了解决歧义所需的证据。在受控响应空间中,基准通道的共形覆盖率为94.98%,但迁移到未测量的部署通道时表现不佳(10.07%),而响应排名区间实现了94.91%的覆盖率;即使零基准错误,在最大残差大小下也仅认证了45.4%的候选者。公开审计揭示了不完备性,包括97.9%的混合Tox21纤维和Matbench与JARVIS主要审计中零中位可认证分数。在保留的重放中,先认证后获取将Tox21中的错误决策从1.19%降至0.027%,JARVIS中从20.3%降至0.128%,同时改变了模型选择并识别了部署相关的探针。部署就绪的基准应报告证据、支持的行动、歧义和完成成本,而不仅仅是分数。

英文摘要

Benchmarks increasingly guide deployment, procurement and scientific screening, yet a score supports only the response it records, not necessarily the deployment action. We introduce deployment-complete benchmarking, which tests whether benchmark evidence determines a deployment action. A benchmark is complete for a claim exactly when the action is constant on each evidence fiber; mixed fibers expose missing deployment information, and completion curves quantify the evidence required to resolve ambiguity. In controlled response spaces, benchmark-channel conformal coverage of 94.98% transferred poorly to an unmeasured deployment channel (10.07%), whereas response-rank intervals achieved 94.91% coverage; even zero benchmark error certified only 45.4% of candidates at the largest residual size. Public audits revealed incompleteness, including 97.9% mixed Tox21 fibers and zero median certifiable fraction in main Matbench and JARVIS audits. In held-out replays, certify-then-acquire reduced false decisions from 1.19% to 0.027% in Tox21 and from 20.3% to 0.128% in JARVIS, while changing model choice and identifying deployment-relevant probes. Deployment-ready benchmarks should report evidence, supported actions, ambiguity and completion cost rather than scores alone.

2605.25168 2026-05-26 eess.IV cs.AI cs.CV

Methodology for Creating a Clinically Verified Dermoscopic Image Dataset

创建临床验证的皮肤镜图像数据集的方法论

Kozachok Elena Sergeevna

AI总结 提出一种结合移动皮肤镜图像采集标准操作程序、结构化元数据信息模型和多阶段专家验证的方法,构建临床验证的皮肤镜图像数据集,用于医学信息学研究。

详情
Comments
22 pages, 5 figures, 5 tables
AI中文摘要

本研究提出了一种构建临床验证的皮肤镜图像数据集的方法,用于医学信息学研究。该工作的相关性在于,自动化诊断支持系统的性能不仅取决于图像数量,还取决于图像采集过程的可重复性、结构化元数据的完整性以及诊断标签的可靠性。国际数据集主要是在与俄罗斯常规门诊实践和移动皮肤镜显著不同的条件下创建的。所提出的方法整合了三个相互关联的组成部分:(1)通过移动皮肤镜采集图像的标准操作程序(SOP),(2)一个信息模型,包含16个结构化元数据字段,组织成六个临床导向的块,采用ISIC兼容的符号表示,以及(3)多阶段专家验证诊断标签(初始临床注释、三位专家的共识审查以及所有恶性肿瘤的组织学确认)。使用该方法,在2025年6月至2026年5月期间,收集了来自443名患者的1026张独特的皮肤镜图像数据集。从1044条初始记录中排除了18个重复项。该数据集包括九个疾病类别;所有39个恶性病变(18个黑色素瘤、15个基底细胞癌和6个鳞状细胞癌)均经过组织学验证。患者年龄范围为2至90岁(中位年龄38岁),其中女性279人(63%),男性164人(37%)。每张图像都附有专家注释的皮肤镜结构和明确的verification_stage字段,指示诊断确认的水平。所得数据集作为临床验证的试点资源,适用于独立模型评估、域偏移分析、可解释性研究和进一步扩展。

英文摘要

This study presents a methodology for constructing a clinically verified dataset of dermatoscopic images for medical informatics research. The relevance of the work is driven by the fact that the performance of automated diagnostic support systems depends not only on the volume of images, but also on the reproducibility of the image acquisition procedure, the completeness of structured metadata, and the reliability of diagnostic labels. International collections were primarily created under conditions that differ substantially from routine Russian outpatient practice and mobile dermatoscopy. The proposed methodology integrates three interconnected components: (1) a standard operating procedure (SOP) for acquiring images via mobile dermatoscopy, (2) an information model comprising 16 structured metadata fields organized into six clinically oriented blocks in ISIC-compatible notation, and (3) a multi-stage expert verification of diagnostic labels (initial clinical annotation, consensus review by three specialists, and histological confirmation of all malignant neoplasms). Using this methodology, a dataset of 1,026 unique dermatoscopic images from 443 patients was collected between June 2025 and May 2026. From 1,044 initial records, 18 duplicates were excluded. The dataset includes nine nosological categories; all 39 malignant lesions (18 melanomas, 15 basal cell carcinomas, and 6 squamous cell carcinomas) were histologically verified. Patient age ranged from 2 to 90 years (median 38), with 279 females (63%) and 164 males (37%). Each image is accompanied by expert-annotated dermatoscopic structures and an explicit verification_stage field indicating the level of diagnostic confirmation. The resulting dataset serves as a pilot clinically verified resource suitable for independent model evaluation, domain shift analysis, interpretability studies, and further expansion.

2605.24728 2026-05-26 cs.AI

Hylos: Operability Contracts for Model-Native Spatial Intelligence

Hylos: 面向模型原生空间智能的可操作性契约

Christopher Da Silva

AI总结 提出Hylos系统架构,通过契约约束和空间事务管理,确保生成或编辑的3D内容具备可操作性,支持CAD、机器人等下游应用。

详情
Comments
27 pages, 7 figures. Systems/position preprint with focused artifact study
AI中文摘要

基础模型日益能够描述、重建和生成3D对象、装配体、场景和环境,但视觉上合理的空间输出尚不具备可操作的3D特性。只有当系统能够识别其实体、框架、表面、约束、来源、允许的动作、预期效果和验证失败时,生成的对象或环境才对智能体有用。本文介绍了Hylos,一种用于契约约束的空间智能的系统架构。Hylos维护场景级别的可操作性状态,涵盖对象、装配体、资产、表面锚点、断言、动作候选、求解器任务、共享执行器调用、能力差距和效果差异。持久的空间变化通过空间事务(SpatialTransaction)进行路由:这是一种提交边界,用于解析引用、检查可允许性、保护不变量、投影效果,并返回提交、审查、回滚、延迟或能力差距结果。本文以系统/立场预印本的形式呈现,并附带一个聚焦的工件研究,而非广泛的基准测试。该研究考察了因果修复:一个可见的错位出现在依赖组件上,而支持的修复位于控制它的上游放置结构中。成功的交互通过场景依赖关系追踪症状,选择支持的上游交互,并应用经过验证的更改,而不是直接编辑可见几何体。更广泛的论点是,空间AI不仅应根据视觉质量进行评估,还应考虑生成或编辑的3D能否成为CAD、机器人、仿真、检测、制造和交互式世界创作的可靠基础。

英文摘要

Foundation models can increasingly describe, reconstruct, and generate 3D objects, assemblies, scenes, and environments, but visually plausible spatial output is not yet operable 3D. A generated object or environment becomes useful to an agent only when the system can identify its entities, frames, surfaces, constraints, provenance, admissible actions, expected effects, and validation failures. This paper introduces Hylos, a systems architecture for contract-bounded spatial intelligence. Hylos maintains scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs. Durable spatial changes are routed through a SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes. The paper is framed as a systems/position preprint with a focused artifact study rather than a broad benchmark. The study examines causal repair: a visible misalignment appears on a dependent component, while the supported repair lies upstream in the placement structure that controls it. The successful interaction traces the symptom through scene dependencies, selects a supported upstream interaction, and applies a validated change instead of directly editing visible geometry. The broader claim is that spatial AI should be evaluated not only by visual quality, but by whether generated or edited 3D can become reliable substrate for CAD, robotics, simulation, inspection, manufacturing, and interactive world authoring.

2605.23904 2026-05-26 cs.AI cs.CL

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

SkillOpt: 自我进化智能体技能的执行策略

Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, Chong Luo

AI总结 提出SkillOpt,一种系统性的可控文本空间优化器,通过分离的优化器模型对技能文档进行有界编辑,并仅在严格改善验证分数时接受编辑,从而稳定训练技能,在六个基准测试中全面优于现有方法。

详情
Comments
27 pages, 4 figures, 6 tables
AI中文摘要

当前的智能体技能要么是手工制作的,要么是一次性生成的,要么通过松散控制的自我修订来进化,这些方法都不像深度学习优化器那样作用于技能,并且都无法在反馈下可靠地改进其起点。我们认为,技能应该作为冻结智能体的外部状态进行训练,并遵循使权重空间优化可复现的相同原则。据我们所知,SkillOpt是第一个系统性的可控文本空间优化器,用于智能体技能:一个独立的优化器模型将带分数的轨迹转换为对单个技能文档的有界添加/删除/替换编辑,并且仅当编辑严格改善保留验证分数时才接受编辑。文本学习率预算、拒绝编辑缓冲区和逐轮慢/元更新使得技能训练稳定,同时在部署时无需增加推理时的模型调用。在六个基准测试、七个目标模型和三个执行框架(直接对话、Codex、Claude Code)中,SkillOpt在所有52个评估的(模型、基准、框架)单元上取得最佳或并列最佳,并击败了每个单元上的所有竞争者,包括人类、一次性LLM、Trace2Skill、TextGrad、GEPA和EvoSkill技能。在GPT-5.5上,它在直接对话中将平均无技能准确率提高了23.5个百分点,在Codex智能体循环中提高了24.8个百分点,在Claude Code中提高了19.1个百分点。迁移实验进一步表明,优化后的技能工件在跨模型规模、在Codex和Claude Code执行环境之间迁移以及迁移到邻近的数学基准测试时,无需进一步优化即可保留其价值。代码:https://aka.ms/skillopt

英文摘要

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt

2605.23082 2026-05-26 stat.ML cs.AI cs.LG

KAPLAN: Kolmogorov-Arnold Prognostic Learnable Activation Networks for Survival Analysis

KAPLAN: 用于生存分析的Kolmogorov-Arnold可预测可学习激活网络

Stelios Boulitsakis Logothetis, Angela Wood, Pietro Liò

AI总结 提出KAPLAN-HR模型,利用B样条Kolmogorov-Arnold网络非参数估计条件风险函数,通过深层架构自动捕捉交互和时变效应,并证明其收敛速率仅依赖于表示平滑性,从而缓解维度灾难,在六个临床数据集上达到或超越现有方法。

详情
Comments
9 pages, 3 figures, 13 supplementary pages. Submitted to NeurIPS 2026
AI中文摘要

生存分析旨在建模协变量和时间如何共同影响右删失下的事件时间分布。经典方法如Cox模型和广义加性模型(GAM)需要手动指定交互和时变效应,这在丰富的临床数据集上越来越不切实际。我们引入了KAPLAN-HR,一种B样条Kolmogorov-Arnold网络(KAN),用于非参数估计条件风险函数作为协变量和时间的联合函数。单层KAPLAN-HR模型恢复GAM,而更深层的架构通过组合捕捉交互和时变效应。我们为非参数KAN风险估计器建立了收敛速率,该速率仅依赖于底层KAN表示的平滑性,而不依赖于协变量维度,从而缓解了KAN可表示目标的维度灾难。在六个临床基准数据集的评估中,KAPLAN-HR匹配或超过了已建立的统计和深度学习生存方法的预测性能。

英文摘要

Survival analysis aims to model how covariates and time jointly shape the time-to-event distribution under right censoring. Classical methods such as the Cox model and generalised additive models (GAMs) require interactions and time-varying effects to be manually specified, which is increasingly impractical on rich clinical datasets. We introduce KAPLAN-HR, a B-spline Kolmogorov-Arnold Network (KAN) for nonparametric estimation of the conditional hazard as a joint function of covariates and time. A single-layer KAPLAN-HR model recovers a GAM, while deeper architectures capture interactions and time-varying effects through composition. We establish a convergence rate for the nonparametric KAN hazard estimator that depends only on the smoothness of the underlying KAN representation and not on the covariate dimension, thereby mitigating the curse of dimensionality for KAN-representable targets. In evaluations over six clinical benchmark datasets, KAPLAN-HR matches or exceeds the predictive performance of established statistical and deep learning survival methods.

2605.18646 2026-05-26 cs.CL

Language-Switching Triggers Take a Latent Detour Through Language Models

语言切换触发器通过语言模型的潜在迂回路径

Francis Kulumba, Wissam Antoun, Théo Lasnier, Benoît Sagot, Djamé Seddah

AI总结 本文通过电路分析揭示了一个8B参数自回归语言模型中语言切换后门攻击的内部机制,该攻击由三个拉丁词触发,将英语输出重定向为法语,并发现触发信号通过正交于语言身份方向的潜在空间传播,最终由最后一层MLP转换为法语logits。

详情
Comments
15 pages, 16 figures. Under review
AI中文摘要

语言模型的后门攻击日益成为安全关注点,但触发器序列劫持模型计算的内部机制仍不明确。我们识别了一个8B参数自回归语言模型中语言切换后门背后的电路,其中三个拉丁词触发器(九个token)将英语输出重定向为法语。我们将该电路分解为三个阶段:(1)早期层的分布式注意力头将触发token组合到最后一个序列位置;(2)产生的信号通过中间层在正交于模型自然语言身份方向的子空间中传播;(3)最后一层的MLP将此潜在信号转换为法语logits。整个电路流经单个位置的串行瓶颈:在任何层破坏该位置完全消除触发器,但也损害模型能力。正交潜在编码表明,在中间表示中搜索类似语言信号的防御将完全错过此触发器。

英文摘要

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigates the trigger but also hinders the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.

2605.02836 2026-05-26 cs.LG math.AT

A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification

一种用于认证点云和图分类的闭式持久性-地标管道

Sushovan Majhi, Atish Mitra, Žiga Virk, Pramita Bagchi

AI总结 提出PLACE管道,通过闭式公式从持久同调签名中分类点云和图,无需学习权重或校准,提供基于间隔的过量风险率、描述符选择规则和每个预测的认证。

详情
Comments
TMLR submission, https://openreview.net/forum?id=4kZxNlE5Ve. v2: variance-aware Pinelis-Bernstein certificate (radius iii) fires on 8/12 benchmarks (v1: not operational); MUTAG: empirical and population NC rules agree on 940/940 predictions. Matching-free nu-coherence replaces non-interference. Le Cam lower bound (Thm 3.2) recast PD-native, matching regime m<~R/D explicit
AI中文摘要

我们引入PLACE(持久性-地标分析分类引擎),一种通过持久同调签名对点云和图进行分类的闭式管道。三个定量保证——基于间隔的过量风险率、闭式描述符选择规则和每个预测的认证——仅从训练标签中推导,无需学习权重或保留校准。嵌入将Mitra-Virk单点坐标函数求和到稀疏地标网格上;闭式权重规则$w_k^2 \propto (d_{k+1}^2 - d_k^2)/R_k^2$在$\nu$-相干性下最大化Mitra-Virk仿射证书中的失真斜率。(i) 由类均值分离$\Delta$和嵌入半径$R$驱动的$O(kR/(\Delta\sqrt{m_{\min}}))$间隔界,在样本匮乏区域$m \lesssim R/\Delta$中由Le Cam极小极大下界匹配。(ii) 在Ledoit-Wolf收缩协方差下的马氏距离是64描述符化学图池中最强的闭式排序器(11个基准上平均Spearman $\rho=+0.56$,11个中10个为正);各向同性替代$\Delta/\sqrt{\ell}$在同质蛋白质/社交池上具有闭式选择一致性率。(iii) 训练时决定的证书,无每个预测开销,有三种具体半径(Pinelis、高斯插件和方差感知的Pinelis-Bernstein)。实验上,PLACE是Orbit5k上最强的基于图的方法,并在MUTAG和COX2上在统计噪声内匹配最强的基于拓扑的基线;剩余差距分为两个可诊断区域(NCI1/NCI109上的描述符盲点;其他地方的池覆盖限制)。Pinelis-Bernstein半径在12个基准中的8个上触发;在MUTAG上,经验和总体最近质心规则在940个保留测试预测中的每一个上一致,验证了证书的机制。

英文摘要

We introduce PLACE (Persistence-Landmark Analytic Classification Engine), a closed-form pipeline for classifying point clouds and graphs through their persistent-homology signatures. Three quantitative guarantees -- a margin-based excess-risk rate, a closed-form descriptor-selection rule, and a per-prediction certificate -- are derived from training labels alone, with no learned weights or held-out calibration. The embedding sums Mitra-Virk single-point coordinate functions over a sparse landmark grid; the closed-form weight rule $w_k^2 \propto (d_{k+1}^2 - d_k^2)/R_k^2$ maximizes the distortion slope in Mitra-Virk's affine certificate under $ν$-coherence. (i) An $O(kR/(Δ\sqrt{m_{\min}}))$ margin bound, driven by class-mean separation $Δ$ and embedding radius $R$, matched in the sample-starved regime $m \lesssim R/Δ$ by a Le Cam minimax lower bound. (ii) The Mahalanobis margin under Ledoit-Wolf-shrunk covariance is the strongest closed-form ranker on a 64-descriptor chemical-graph pool (mean Spearman $ρ= +0.56$ across 11 benchmarks, positive on 10 of 11); the isotropic surrogate $Δ/\sqrt{\ell}$ admits a closed-form selection-consistency rate on the homogeneous protein/social pools. (iii) A training-time-decided certificate, with no per-prediction overhead, in three concrete radii (Pinelis, Gaussian plug-in, and variance-aware Pinelis-Bernstein). Empirically, PLACE is the strongest diagram-based method on Orbit5k and matches the strongest topology-based baseline within statistical noise on MUTAG and COX2; remaining gaps fall into two diagnosable regimes (descriptor blindness on NCI1/NCI109; pool-coverage limits elsewhere). The Pinelis-Bernstein radius fires on 8 of the 12 benchmarks; on MUTAG the empirical and population nearest-centroid rules agree on every one of 940 held-out test predictions, validating the certificate's mechanism.

2605.25991 2026-05-26 cs.LG cs.NA math.NA

Fuzzy PyTorch: Rapid Numerical Variability Evaluation for Deep Learning Models

Fuzzy PyTorch: 深度学习模型的快速数值变异性评估

Inés Gonzalez-Pepe, Hiba Akhaddar, Tristan Glatard, Yohan Chatelain

AI总结 提出Fuzzy PyTorch框架,通过集成随机算术和概率舍入实现深度学习模型数值变异性的快速评估,相比现有工具Verrou实现5至60倍加速,并支持从1到3.41亿参数的模型规模。

详情
Comments
19 pages, 8 figures, Published in Transactions on Machine Learning Research (01/2026)
AI中文摘要

我们介绍了Fuzzy PyTorch,一个用于快速评估深度学习(DL)模型中数值变异性的框架。随着DL越来越多地应用于各种任务,理解浮点运算带来的变异性对于确保稳健可靠的性能至关重要。评估此类变异性的工具必须具有可扩展性、高效性,并能与现有框架无缝集成,同时最小化代码修改。Fuzzy PyTorch通过将随机算术集成到PyTorch中实现了这一点,它采用了一种名为“概率舍入与指令集管理”的新型库,该库与数值分析编译器Verificarlo接口。该库提供了随机舍入模式以及一种新模式:上下舍入。对比评估显示,Fuzzy PyTorch保持了模型性能,并且与最先进的工具Verrou相比,运行时间减少了5倍到60倍。我们进一步通过运行从1到3.41亿参数的模型展示了其可扩展性,确认了其在小型和大型DL架构中的适用性。总体而言,Fuzzy PyTorch为评估深度学习中的数值变异性提供了一种高效、可扩展且实用的解决方案,使研究人员和从业者能够在不牺牲性能或计算效率的情况下量化和管理浮点不确定性。

英文摘要

We introduce Fuzzy PyTorch, a framework for rapid evaluation of numerical variability in deep learning (DL) models. As DL is increasingly applied to diverse tasks, understanding variability from floating-point arithmetic is essential to ensure robust and reliable performance. Tools assessing such variability must be scalable, efficient, and integrate seamlessly with existing frameworks while minimizing code modifications. Fuzzy PyTorch enables this by integrating stochastic arithmetic into PyTorch through Probabilistic Rounding with Instruction Set Management, a novel library interfacing with Verificarlo, a numerical analysis compiler. The library offers stochastic rounding mode and a novel mode; up-down rounding. Comparative evaluations show Fuzzy PyTorch maintains model performance and achieves runtime reductions of 5x to 60x versus Verrou, a state-of-the-art tool. We further demonstrate scalability by running models from 1 to 341 million parameters, confirming applicability across small and large DL architectures. Overall, Fuzzy PyTorch provides an efficient, scalable, and practical solution for assessing numerical variability in deep learning, enabling researchers and practitioners to quantify and manage floating-point uncertainty without compromising performance or computational efficiency.

2605.25988 2026-05-26 cs.CL

What Makes a Medical Checker Trainable? Diagnosing Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA

什么使医学检查器可训练?诊断生物医学问答中检查器引导的RAG中的信号崩溃和奖励黑客

Yuelyu Ji, Min Gu Kwak, Hang Zhang, Xizhi Wu, Chenyu Li, Yanshan Wan

AI总结 本文通过比较四种NLI检查器作为GRPO训练的医学RAG代理的过程奖励,诊断了信号崩溃和奖励黑客现象,发现检查器的输出分布而非保留准确率决定了其是否提供可训练梯度,并提出了验证器作为奖励系统的边界条件。

详情
AI中文摘要

医学RAG需要基于证据的声明,因此将声明级别的NLI检查器插入检索增强的强化学习是直观的。 extbf{我们发现,检查器在训练期间的输出分布,而不是其保留准确率,决定了它是否提供可训练梯度。}我们比较了四种NLI检查器后端作为GRPO训练的医学RAG代理(Qwen2.5-7B,在Qwen3-4B和Llama-3.1-8B上复制)的过程奖励,跨越四个保留的医学QA基准。出现了三个诊断发现。 extbf{(i)} 信号崩溃是log概率特定的:LLM log概率评分将超过97%的声明标记为中性——将RL梯度降至零——而校准的MedNLI分类器对相同对进行非退化评分。 extbf{(ii)} 在答案质量上,中等信号优于强信号:一个强大的专有检查器触发三步奖励黑客级联——超短答案、搜索回避、语言崩溃——因此中等信号的本地分类器训练出更高质量的模型( extbf{+12% BERTScore相对于零样本,无GPT依赖})。 extbf{(iii)} 信号强度是策略依赖的:相同的检查器在一个策略上表现为中等,但在另一个策略上表现为强,而不触发级联终点。我们将这些视为验证器作为奖励系统的边界条件。

英文摘要

Medical RAG needs evidence-grounded claims, so plugging a claim-level NLI checker into retrieval-augmented RL is intuitive. \textbf{We find that the checker's \emph{output distribution} during training, not its held-out accuracy, decides whether it provides trainable gradient.} We compare four NLI checker back-ends as process rewards inside a GRPO-trained medical RAG agent (Qwen2.5-7B, replicated on Qwen3-4B and Llama-3.1-8B) across four held-out medical QA benchmarks. Three diagnostic findings emerge. \textbf{(i)} Signal collapse is log-prob-specific: LLM log-probability scoring labels over 97\% of claims neutral -- collapsing the RL gradient to zero -- while a calibrated MedNLI classifier scores the same pairs non-degenerately. \textbf{(ii)} Moderate signal beats strong signal on answer quality: a strong proprietary checker triggers a three-step reward-hacking cascade -- ultra-short answers, search avoidance, language collapse -- so a moderate-signal local classifier trains a higher-quality model (\textbf{+12\% BERTScore over zero-shot, no GPT dependency}). \textbf{(iii)} Signal strength is policy-dependent: the same checker registers as moderate on one policy but strong on another without triggering the cascade end-state. We frame these as boundary conditions for verifier-as-reward systems.

2605.25984 2026-05-26 cs.CL cs.AI

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

SafeCtrl-RL: 通过RL驱动的提示优化的LLM对话推理时自适应行为控制

Michael Orme, Yanchao Yu, Zhiyuan Tan

AI总结 提出SafeCtrl-RL框架,利用强化学习在推理时动态选择提示调整策略,无需重新训练即可抑制不安全行为,提升LLM对话的安全性和响应质量。

详情
AI中文摘要

确保大型语言模型(LLM)的安全和上下文适当行为仍然是实际部署的关键挑战。我们提出了 extbf{SafeCtrl-RL},一个推理时行为控制框架,无需模型重新训练或参数修改即可实现自适应安全调节。该方法将对话生成形式化为一个序列决策过程,其中强化学习代理根据上下文反馈动态选择提示调整策略。这使得不安全行为可以通过迭代细化被抑制,我们将其概念化为推理时行为遗忘。在多个LLM和不安全对话场景下的评估表明,SafeCtrl-RL一致地提高了安全性和响应质量,优于现有的基于提示的优化方法,并实现了良好的性能-效率权衡。**警告:本文可能包含有害语言的示例,建议读者谨慎阅读。**

英文摘要

Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.

2605.25979 2026-05-26 cs.CV

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

LLaVA-OneVision-2:迈向下一代感知智能

Xiang An, Yin Xie, Feilong Tang, Yunyao Yan, Huajie Tan, Didi Zhu, Changrui Chen, Xiuwei Zhao, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Kaichen Zhang, Wenkang Zhang, Zheng Cheng, Nansen Zhang, Chunsheng Wu, Chunjiang Ge, Zimin Ran, Dehua Song, Chunyuan Li, Shikun Feng, Ming Hu, Zhangquan Chen, Junbo Niu, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng

AI总结 提出LLaVA-OV-2模型,通过编解码流令牌化、窗口注意力和3D RoPE实现统一视频理解与时空定位,在多项基准上超越Qwen3-VL-8B。

详情
AI中文摘要

我们介绍LLaVA-OneVision-2(LLaVA-OV-2),这是LLaVA-OneVision系列中迄今为止能力最强的视觉语言模型,在广泛的多模态基准测试中均取得了卓越性能。该模型基于原生OneVision编码器,并引入窗口注意力机制以实现高效的局部计算,同时保持原生分辨率。其关键进展是编解码流令牌化:它将压缩视频视为连续的比特成本流,其中比特成本动态决定自适应时间分组,运动残差线索选择显著空间证据到紧凑的视觉画布中。这种分配将有限的令牌预算集中在包含事件的内容上,相比固定图片组,实现了更稳定的长视频令牌压缩。共享的3D RoPE进一步将编解码画布、采样帧和图像置于统一的时空坐标系中。此外,我们围绕大规模开放监督构建了LLaVA-OV-2数据和训练栈:约800万重新标注的视频样本用于预训练,400万样本的空间语料库用于微调。我们还引入了JumpScore,这是一个针对高频、密集重复运动中的细粒度定位的时空定位基准,填补了现有视频评估的空白。LLaVA-OV-2的一项突出能力是其在视频理解、时空定位、空间定位和操作轨迹推理上的统一感知。在JumpScore上,LLaVA-OneVision-2-8B达到74.9 JumpScore mAP,比Qwen3-VL-8B(30.1)高出44.8分;在同一基准的匹配视觉令牌预算下,编解码流输入相比帧采样在时空定位上提升9.7分。在标准基准上,LLaVA-OneVision-2-8B在视频任务上平均比Qwen3-VL-8B高出4.3分,在空间任务上高出5.3分,在跟踪任务上平均J&F高出15.6分。

英文摘要

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.

2605.25977 2026-05-26 cs.CL cs.AI cs.LG

Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning

创意质量对齐:通过思维链微调实现专家隐性知识迁移

Bo Zou, Chao Xu

AI总结 本文通过低数据成本和小基模型的严格工程条件,实证验证了校准惊喜中的创意质量度量,并发现数据偏差,提出创意质量对齐方法及理论解释。

详情
AI中文摘要

本文对校准惊喜(Zou & Xu, 2026a)中提出的创意质量度量进行了实证实现。本文解决的问题是:这一数学主张在工程层面是否成立?为使答案尽可能通用,我们特意选择了最严格的工程条件:低数据成本和小基模型。训练数据来自BC协议(Zou & Xu, 2026b)产生的大约100个专家思维链(CoT)标注。我们还发现了一个数据偏差:大多数公开可用的对齐数据集偏向于工艺相关知识,而受众建模和现实逻辑覆盖系统性薄弱。我们使用术语“创意质量对齐”(CQA)来描述这类工程方法。我们还提供了一个支持性的理论观察:在具有单一条件分布架构的LLM中,通过架构对偶性,校准欣赏侧会自动迁移到生成侧。这是大约100个CoT示例就足够的结构性原因——而非像LIMA(Zhou et al., 2023)那样的纯粹经验观察。

英文摘要

This paper provides an empirical implementation of the creative quality metric proposed in Calibrated Surprise (Zou & Xu, 2026a). The question this paper addresses is: does this mathematical claim hold at the engineering level? To make the answer as general as possible, we deliberately choose the strictest engineering conditions: low data cost and a small base model. Training data comes from approximately 100 expert chain-of-thought (CoT) annotations produced by the BC Protocol (Zou & Xu, 2026b). We also identify a data bias: most publicly available alignment datasets are skewed toward craft-related knowledge, while audience modeling and reality-logic coverage are systematically weak. We use the term Creative Quality Alignment (CQA) to describe this class of engineering methods. We also offer a supporting theoretical observation: in an LLM with a single conditional distribution architecture, calibrating the appreciation side automatically transfers to the generation side via architectural duality. This is the structural reason why ~100 CoT examples are sufficient -- not a purely empirical observation like LIMA (Zhou et al., 2023).

2605.25969 2026-05-26 cs.CL

Triplet-Block Diffusion RWKV

三元组块扩散RWKV

Ke Lin, Yiyang Luo, Zhaolong Su, Yunya Song, Anyi Rao

AI总结 提出B^3D-RWKV,通过三元组块布局方法将RWKV的线性推理效率与双向离散扩散结合,实现并行解码,在8任务套件上达到可比精度,解码吞吐量平均提升1.6倍。

详情
AI中文摘要

因果Transformer语言模型存在严格顺序解码和每步二次注意力成本的问题。虽然线性时间因果模型和离散扩散模型各自解决了这些弱点,但它们的整合本质上不一致:扩散需要双向注意力,而因果模型是单向的。为了统一这些架构,我们提出了$B^3D-RWKV$,一种扩散RWKV变体,通过\emph{三元组块布局}方法将模型的$O(L)$推理效率与并行、双向离散扩散相结合。$B^3D-RWKV-7.2B$在8任务套件上达到了与现有模型相当的准确率,同时在解码吞吐量上显著优于基线,平均加速$\mathbf{1.6 imes}$。

英文摘要

Causal Transformer language models suffer from strictly sequential decoding and a quadratic per-step attention cost. While linear-time causal models and discrete diffusion models each address these weaknesses, their integration remains inherently inconsistent: diffusion requires bidirectional attention, while causal models are unidirectional. To unify these architectures, we propose $B^3D-RWKV$, a diffusion RWKV variant that integrates the model's $O(L)$ inference efficiency with parallel, bidirectional discrete-diffusion through a \emph{triplet-block layout} method. $B^3D-RWKV-7.2B$ reaches comparable accuracy on an 8-task suite versus existing models while significantly outperforming baselines in decoding throughput with an average of $\mathbf{1.6\times}$ speedup.

2605.25968 2026-05-26 cs.CV

Context-driven Missing-Modality Learning for Robust Medical Diagnosis with Image-Tabular Data

基于上下文驱动的缺失模态学习用于图像-表格数据的鲁棒医学诊断

Tianling Liu, Lequan Yu, Tong Han, Liang Wan

AI总结 提出CMML框架,通过级联残差变换器自编码器合成缺失模态并利用上下文令牌进行语义对齐,在三种医学数据集上超越现有方法。

详情
Comments
12 pages, 8 figures
AI中文摘要

虽然整合多种成像和临床表格记录的多模态数据对于准确医学诊断至关重要,但临床实践中特定模态的任意缺失普遍存在,严重降低了多模态模型的性能。现有方法要么丢弃缺失模态导致信息丢失,要么在未捕获复杂模态间依赖关系的情况下难以合成它们。为解决这些限制,我们提出了一种新颖的上下文驱动缺失模态学习(CMML)框架,该框架顺序执行模态合成和语义对齐,以在任意缺失条件下实现鲁棒诊断。具体来说,我们设计了一个基于级联残差变换器的自编码器(CRTA),利用可学习的上下文令牌作为数据集级语义先验来捕获模态间依赖关系并合成关键的缺失表示。这些表示进一步通过模态特定的记忆库得到丰富。为解决原始可用表示与合成表示之间的差异,我们通过注入来自CRTA输出的多模态表示,将学习到的上下文令牌转化为实例自适应的语义参考。该参考引导异构模态表示对齐到统一空间,最后应用类别感知对比细化来探索判别性诊断线索。在皮肤病变(Derm7pt)、眼病(ODIR)和脑膜瘤(MEN)数据集上的广泛评估表明,CMML显著优于最先进(SOTA)方法,平均AUC分别提升1.26%、0.97%和1.32%。

英文摘要

While multimodal data integrating diverse imaging and clinical tabular records is crucial for accurate medical diagnosis, the arbitrary absence of specific modalities is prevalent in clinical practice, severely degrading the performance of multimodal models. Existing methods either discard missing modalities, leading to information loss, or struggle to synthesize them without capturing complex inter-modal dependencies. To address these limitations, we propose a novel Context-driven Missing-Modality Learning (CMML) framework, which sequentially performs modality synthesis and semantic alignment to achieve robust diagnosis under arbitrary missing conditions. Specifically, we design a Cascade Residual Transformer-based Autoencoder (CRTA) that leverages learnable context tokens acting as dataset-level semantic prior to capture inter-modal dependencies and synthesize key missing representations. These representations are further enriched by modality-specific memory banks. To resolve the discrepancy between original available and synthesized representations, we transform the learned context tokens into instance-adaptive semantic references by infusing multimodal representations from the CRTA's outputs. This reference guides the alignment of heterogeneous modality representations into a unified space, where class-aware contrastive refinement is finally applied to explore discriminative diagnostic cues. Extensive evaluations on skin lesion (Derm7pt), ocular disease (ODIR), and meningioma (MEN) datasets demonstrate that CMML significantly outperforms state-of-the-art (SOTA) methods, yielding AVG AUC improvements of 1.26%, 0.97%, and 1.32%, respectively.

2605.25967 2026-05-26 cs.LG cs.SD

Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

隐藏在明文令牌中:简单、鲁棒、无梯度的合成音频水印

Georgios Milis, Yubin Qin, Yihan Wu, Heng Huang

AI总结 本文利用离散化中的词汇冗余,提出一种无需微调或梯度的合成音频水印方法,通过社区检测缩减词汇表提升检测鲁棒性,在音频修改下仍保持高性能。

详情
Comments
Accepted to ICML 2026
AI中文摘要

随着政策追赶生成式AI的能力,水印技术成为内容溯源工作的核心。自回归模型的推理时水印由于离散化不一致而不适用于连续模态。现有方法通过微调模态分词器来克服这一问题,但失去了水印无需训练的优势。在这项工作中,受离散化中词汇冗余的启发,我们提出了一种优雅的解决方案,用于合成音频的强大且鲁棒的水印。我们从理论上分析了令牌错误对水印检测的影响,并通过社区检测获得的缩减词汇表有效缓解了这些问题。充分的实验表明,我们的无梯度方法可以将可检测性提高几个数量级,同时实现对音频修改的内置鲁棒性。广泛地说,我们发现了多媒体中令牌级水印的新最先进技术,这仅仅源于离散表示学习的本质。

英文摘要

As policy catches up with the capabilities of generative AI, watermarking is central to content provenance efforts. Inference-time watermarks for autoregressive models are unfit for continuous modalities due to discretization inconsistencies. Existing methods overcome this by finetuning the modality tokenizers, nullifying the watermark's training-free advantage. In this work, motivated by the vocabulary redundancy of discretization, we propose an elegant solution for powerful and robust watermarking of synthetic audio. We theoretically analyze the impact of token errors on watermark detection, and effectively mitigate them using a reduced vocabulary obtained via community detection. Thorough experiments showcase that our gradient-free method can boost detectability by several orders of magnitude, while also achieving built-in robustness to audio modifications. Broadly, we discover a new state-of-the-art for token-level watermarks in multimedia, which simply arises from the nature of discrete representation learning.

2605.25966 2026-05-26 cs.LG cs.CL stat.ML

Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training

在小于100M参数量化感知训练中映射调度策略与位宽边界

Christian Brandt Thomassen

AI总结 通过大规模实验研究子100M参数解码器语言模型中,量化感知训练的最佳学习率调度是否依赖于位宽,发现INT6 QAT无需不同调度,INT4在50M以上需wd33调度,以下则噪声主导。

详情
Comments
20 pages, 6 figures, 4 tables. 1345 training runs total (720 + 625). Submitted for review at TMLR
AI中文摘要

我们测试了在子100M参数解码器语言模型中,从初始化开始的量化感知训练(QAT)的最佳学习率调度是否依赖于位宽。一项720次运行的因子网格实验(阶段2)覆盖了位宽×衰减分数×学习率大小×模型大小×随机种子(FP16/INT8/INT6,15M-100M,5个种子),发现在每个(位宽,大小)单元中,最佳衰减分数为33%。主要假设——INT6 QAT需要与高精度训练不同的调度——在FP16/INT8/INT6下被证伪。后续625次运行(阶段5)沿五个轴探测零假设:优化器(AdamW)、调度形状(余弦)、训练长度(最多9倍迭代次数)、扩展的大小扫描(5M-350M)以及从3M到100M的INT4扫描。零假设在所有三种设置变化下均稳健。INT6的惩罚遵循对数线性缩放定律,其在阶段2的拟合预测了五个保留的阶段5大小(5M、8M、175M、250M、350M),且均在95%预测区间内(5/5)。对于INT4,情况比高精度更清晰:在50M和100M时,wd33明确最优(配对z~12-15,10/10种子);低于50M时,在从3M到30M的六个测试大小中,没有单个大小显示出统计显著的调度偏好,且每个大小的平均惩罚在种子级噪声内振荡。因此,边界是从低于50M的噪声主导区域到50M及以上明确的wd33区域的过渡,而非清晰的wd10区域。权重到网格距离的探测证伪了FP16/INT8/INT6零假设的最简单机制(快速网格锁定):在衰减前,INT6-QAT权重与INT6网格的距离基本与FP16权重相同(比率~1.04)。实用建议:在子100M规模下,在FP16上调优一次学习率调度,并原封不动地应用于INT8/INT6 QAT;对于50M以上的INT4,使用wd33;对于50M以下的INT4,调度选择在噪声中。

英文摘要

We test whether the optimal learning-rate schedule depends on bit-width during from-initialisation quantisation-aware training (QAT) for sub-100M decoder language models. A 720-run factorial grid (Phase 2) over bit-width x warmdown fraction x LR magnitude x model size x seed (FP16/INT8/INT6, 15M-100M, 5 seeds) finds the optimal warmdown is 33% at every (bit-width, size) cell. The primary hypothesis -- that INT6 QAT requires a different schedule than higher-precision training -- is falsified at FP16/INT8/INT6. A 625-run follow-up (Phase 5) probes the null along five axes: optimiser (AdamW), schedule shape (cosine), training length (up to 9x more iterations), an extended size sweep (5M-350M), and an INT4 sweep from 3M to 100M. The null is robust under all three setup changes. The INT6 penalty follows a log-linear scaling law whose fit on Phase 2 predicts the five held-out Phase 5 sizes (5M, 8M, 175M, 250M, 350M) within their 95% prediction intervals (5/5). For INT4 the picture is sharper than the higher precisions: at 50M and 100M, wd33 is decisively optimal (paired z ~ 12-15, 10/10 seeds); below 50M, across the six tested sizes from 3M to 30M, no individual size shows a statistically significant schedule preference and the per-size mean penalty oscillates within seed-level noise. The boundary is therefore a transition between a noise-dominated regime below 50M and a decisive wd33 regime at and above 50M, not a clean wd10 region. A weight-to-grid-distance probe falsifies the simplest mechanism for the FP16/INT8/INT6 null result (rapid grid-snapping): pre-warmdown, INT6-QAT weights sit at essentially the same distance from the INT6 grid as FP16 weights (ratio ~ 1.04). Practical recommendation: at sub-100M scale, tune the LR schedule once at FP16 and apply unchanged to INT8/INT6 QAT; for INT4 at 50M+ use wd33; for INT4 below 50M the schedule choice is in the noise.

2605.25964 2026-05-26 cs.AI

LECTOR: Joint Optimization of Scientific Reasoning Graphs and Introduction Generation

LECTOR: 科学推理图与引言生成的联合优化

Jiabei Xiao, Yizhou Wang, Chen Tang, Pengze Li, Wanli Ouyang, Shixiang Tang

AI总结 提出LECTOR框架,通过逻辑-表达协同强化学习联合优化科学推理图的结构保真度和引言生成质量,在Nature Communications数据集上实现显著提升。

详情
Comments
25 pages
AI中文摘要

AI科学家在研究流程的多个阶段已展现出有希望的进展,其中自动科学论文写作仍然是一个艰巨的挑战。引言写作尤其困难,不仅要求语言流畅,还需要逻辑合理性和可验证的忠实性。大多数AI辅助方法将任务视为文本生成而非推理和结构化,导致严重缺陷,例如引用幻觉。为解决此问题,我们首先定义了内容条件引言生成(CCIG)任务,要求引言基于论文的核心证据。然后我们提出LECTOR,一种新颖的逻辑-表达协同强化学习框架,能够严格遵循科学家的逻辑,添加高质量引用并保持结构化表达。LECTOR首先从论文主体构建逻辑推理图,作为可验证的逻辑蓝图。随后,它采用逻辑-表达协同奖励机制,联合优化图的结构保真度和最终叙述的质量。我们从Nature Communications论文中构建数据集来评估我们的方法。大量实验表明,在逻辑保真度和引言生成质量指标上均有一致改进,例如图质量(+26.7%)、引用质量(+8.6%)和论文一致性(+3.3%)。代码和数据可在https://github.com/Xiao-Youth/LECTOR获取。

英文摘要

AI Scientists have shown promising progress across multiple stages of the research pipeline, among which automatic scientific paper writing remains a formidable challenge. The Introduction writing is especially challenging, which demands not only linguistic fluency, but logical soundness and verifiable faithfulness. Most AI-assisted methods treat the task as text generation instead of reasoning and structuring, leading to severe drawbacks, e.g., hallucinating citations. To address this, we first formulate the Content-Conditional Introduction Generation (CCIG) task, which requires grounding the Introduction in the paper's core evidence. We then propose LECTOR, a novel Logic-Expression Co-Reinforcement Learning framework that can strictly follow the scientist's logic, add high-quality citations and keep structured expressions. LECTOR first constructs a logic-reasoning graph from the paper's main body to serve as a verifiable logical blueprint. Subsequently, it employs a Logic-Expression Co-Rewarding mechanism to jointly optimize for both the graph's structural fidelity and the final narrative's quality. We conduct a dataset from Nature Communications papers to assess our method. Extensive experiments show consistent improvements in both logic fidelity and Introduction generation quality metrics, e.g., Graph Quality (+26.7%), Citation Quality (+8.6%), and Paper Consistency (+3.3%). Code and data are available at https://github.com/Xiao-Youth/LECTOR.

2605.25962 2026-05-26 cs.SD cs.AI

Continual Speaker Identity Unlearning with Minimal Interference

持续说话人身份遗忘与最小干扰

Jinju Kim, Yunsung Kang, Gyeong-Moon Park, Jong Hwan Ko

AI总结 提出CORTIS框架,通过Fisher信息参数掩码和正交投影实现零样本语音合成中持续说话人身份遗忘,避免先前遗忘的说话人重新出现。

详情
Comments
preprint
AI中文摘要

机器遗忘从预训练模型中移除指定概念或知识。最近的工作将此范式扩展到零样本语音合成(ZS-TTS)中的说话人身份遗忘,即选择性擦除模型复制说话人声音的能力。然而,现有方法默认所有遗忘请求同时到达,这是一个不现实的假设,因为隐私驱动的移除会随时间顺序到达。我们证明这一假设破坏了现有最先进的方法:遗忘每个新说话人会完全恢复先前遗忘的说话人,重新引入遗忘本应消除的隐私风险。我们提出了累积正交身份抑制(CORTIS),这是首个在ZS-TTS中实现持续说话人身份遗忘的框架,无需访问先前遗忘的说话人数据。CORTIS结合了基于Fisher信息的参数掩码(将更新定位到与说话人相关的权重)和针对先前遗忘更新子空间的正交投影。使用VoiceBox,CORTIS在长请求序列中遗忘每个请求的说话人,同时保持先前遗忘的说话人被遗忘,显著优于先前方法的顺序应用。演示地址:https://cumulativeortis.github.io/ 。

英文摘要

Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Existing methods, however, quietly assume all unlearning requests arrive at once; an unrealistic assumption, since privacy-motivated removals arrive sequentially over time. We show this assumption breaks state-of-the-art methods: unlearning each new speaker fully revives previously unlearned speakers, reintroducing the very privacy risk unlearning was meant to eliminate. We present Cumulative ORThogonal Identity Suppression (CORTIS), the first framework for continual speaker identity unlearning in ZS-TTS that requires no access to previously-unlearned speaker data. CORTIS combines Fisher-information-based parameter masking, which localizes updates to speaker-relevant weights, with orthogonal projection against subspaces spanned by prior unlearning updates. With VoiceBox, CORTIS unlearns each requested speaker while keeping previously unlearned speakers forgotten across long request sequences, substantially outperforming sequential application of prior methods. The demo is available at https://cumulativeortis.github.io/ .

2605.25958 2026-05-26 cs.CL cs.CE

PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction

PolyGnosis 2.0: 通过智能体工程增强LLM推理,用于Polymarket和OSINT洞察提取

Daren Wang, Hong Xu, Jiawen Xian

AI总结 本文提出PolyGnosis 2.0多智能体架构,通过融合Polymarket异常信号与全球开源情报(GDELT)来提取预测情报,并量化“视角错配”作为高阿尔法交易信号,同时评估“工具工程”技术(如反射循环、工具调用、分治和思维链)在高噪声金融领域的有效性。

详情
AI中文摘要

本文介绍了PolyGnosis 2.0,这是一种开创性的多智能体架构,旨在通过综合Polymarket异常信号与全球开源情报(OSINT)流(特别是全球事件、语言和语调数据库(GDELT))来提取预测情报。我们定义并瞄准“视角错配”,即Polymarket情绪与全球媒体流之间的叙事分歧,作为高阿尔法交易信号。超越通用的智能体优越性,我们严格量化了“工具工程”技术(包括反射循环、工具调用、分治分区(D&C)和思维链(CoT))在高噪声金融领域的有效性。我们针对人类专家基准的实证评估表明,虽然结构分区对于多维对齐是必需的,但无约束的终端反射会主动导致逻辑漂移。此外,我们在所有智能体配置的叙事推理过程中发现了一种普遍的“共识偏差”,需要确定性验证。最终,我们分离出一个帕累托最优配置,该配置在最小化延迟和令牌开销的同时实现了专业级的分析精度,为预测市场中的自主智能提供了稳健的蓝图。

英文摘要

This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and target "Perspective Mismatches", the narrative divergence between Polymarket sentiment and global media flows, as high-alpha trading signals. Moving beyond generic agentic superiority, we rigorously quantify the efficacy of "Harness Engineering" techniques, including reflection loops, tool-calling, divide-and-conquer partitioning (D&C), and chain-of-thought (CoT), within high-noise financial domains. Our empirical evaluation against human-expert benchmarks reveals that while structural partitioning is mandatory for multi-dimensional alignment, unconstrained terminal reflection actively induces logical drift. Furthermore, we identify a pervasive "consensus bias" across all agent configurations during narrative reasoning, necessitating deterministic validation. Ultimately, we isolate a Pareto-optimal configuration that achieves professional-grade analytical precision while minimizing latency and token overhead, providing a robust blueprint for autonomous intelligence in prediction markets.