arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3990
2606.07645 2026-06-09 cs.CV cs.AI 新提交

FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

FineGen:基于VLM的多智能体框架用于细粒度图像-文本数据集构建

Chang Kong, Yuebing Li, Peng Mo, Haigang Zhang, Qiuming Luo

发表机构 * Shenzhen Polytechnic University(深圳职业技术大学) Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong Macao Greater Bay Area(粤港澳大湾区应用人工智能研究所) Shenzhen University(深圳大学)

AI总结 提出FineGen框架,通过生成-验证-校正流水线和闭环反馈机制自动构建含硬负样本的细粒度数据集,在ImageNet上构建FineGen-100K,硬样本准确率提升14.4%。

详情
Comments
15 pages, 2 figures, conference
AI中文摘要

当前视觉-语言数据集中硬负样本的稀缺严重阻碍了细粒度感知。为此,我们提出FineGen,一种基于VLM的多智能体框架,用于自动化数据集构建。通过采用协作的生成-验证-校正流水线及闭环反馈机制,FineGen确保合成的硬负样本在语义上有效且与视觉内容严格矛盾。将其应用于ImageNet,我们构建了FineGen-100K,一个包含超过147,000个属性特定硬负样本的分层数据集,正负样本比严格为1:10。广泛评估证实了96.7%的属性有效性。关键的是,在FG-OVD基准上的下游验证表明,在FineGen-100K上微调后,硬样本准确率大幅提升14.4%,显著优于现有最先进方法。

英文摘要

The scarcity of hard negative samples in current vision-language datasets significantly hinders fine-grained perception. To address this, we propose FineGen, a VLM-based Multi-Agent framework for automated dataset construction. By employing a collaborative Generation-Verification-Correction pipeline with a closed-loop feedback mechanism, FineGen ensures synthesized hard negatives are semantically valid yet strictly contradictory to visual content. Applying this to ImageNet, we construct FineGen-100K, a hierarchical dataset containing over 147,000 attribute-specific hard negatives with a rigorous 1:10 positive-to-negative ratio. Extensive evaluations confirm a 96.7% attribute validity rate. Crucially, downstream validation on the FG-OVD benchmark shows that fine-tuning on FineGen-100K yields a substantial +14.4% accuracy improvement on hard samples, significantly outperforming state-of-the-art methods.

2606.07642 2026-06-09 cs.CV cs.CY 新提交

Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View

视觉语言模型能否感知传感器所感?一种可扩展的专家引导设计用于从街景评估轮椅可达性

Dongdong Wang, Alina Hagen, Isabelle Gatmaitan, Hao Zhou, Yiwen Dong, Shabboo Valipoor, Vivian W. H. Wong, Lingyao Li

发表机构 * University of Florida(佛罗里达大学) University of South Florida(南佛罗里达大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出专家引导的检索增强框架,利用视觉语言模型从谷歌街景图像识别轮椅可达性障碍,通过GPS轮椅停留行为验证,表明VLM评分与移动摩擦部分一致,但细粒度障碍识别有限。

详情
AI中文摘要

评估建筑环境交互(如轮椅可达性)是困难的,因为现实世界的移动性受到分布式、上下文依赖和临时性障碍的影响,这些障碍难以大规模捕捉。为了支持可扩展的评估,本文研究了视觉语言模型(VLM)是否能够从谷歌街景(GSV)图像中识别可达性障碍。我们提出了一种专家引导的检索增强框架,结合GSV图像、ADA指导原则和专家制定的评分标准来评估可达性维度。我们在佛罗里达大学收集了一个校园规模的数据集,将407个独特的GSV位置与GPS衍生的轮椅停留行为作为移动摩擦信号相关联。结果表明,VLM评分与停留时间既呈负相关又在分布上相似,表明与移动摩擦的行为代理部分但一致的对齐。视觉线索分析显示,某些环境对象(如路缘坡道和人行横道)与较高的VLM可达性评分相关,而对于细微的表面条件、临时障碍物和视角依赖的障碍,对齐仍然有限。总体而言,我们的发现显示了专家引导的VLM在可扩展的可达性评估中的潜力,与真实世界轮椅导航的传感器衍生指标相一致。

英文摘要

Assessing built-environment interaction, such as wheelchair accessibility, is difficult because real-world mobility is shaped by distributed, context-dependent, and temporary barriers that are hard to capture at scale. To support scalable assessment, this paper examines whether vision-language models (VLMs) can identify accessibility barriers from Google Street View (GSV) imagery. We propose an expert-guided retrieval-augmented framework that combines GSV images, ADA-informed guidance, and expert-derived rubrics to evaluate accessibility dimensions. We collect a campus-scale dataset at the University of Florida, linking 407 unique GSV locations with GPS-derived wheelchair dwell behavior as a mobility-friction signal. Results show that VLM ratings are both negatively correlated and distributionally similar with dwell time, indicating partial but consistent alignment with a behavioral proxy for mobility friction. Visual cue analysis shows that certain environmental objects, such as curb ramps and crosswalks, are associated with higher VLM accessibility scores, while alignment remains limited for subtle surface conditions, transient obstructions, and viewpoint-dependent barriers. Overall, our findings show the potential of expert-guided VLMs for scalable accessibility assessment aligning with sensor-derived indicators of real-world wheelchair navigation.

2606.07641 2026-06-09 cs.CV 新提交

Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

可读但不可预测:视觉语言模型中的旋转结果预测

Lexin Wang, Shenghua Liu, Yiwei Wang, Jiafeng Guo, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of California, Merced(加州大学默塞德分校)

AI总结 研究视觉语言模型能否仅从原图预测180°旋转后的内容,引入RotOutBench基准,发现模型能识别但无法预测旋转结果。

详情
AI中文摘要

视觉语言模型能否仅从原始图像预测180°旋转后会看到什么?我们通过旋转结果预测来研究这种能力:给定原始图像,模型必须回答在180°平面旋转后会看到或读到什么,而不直接观察旋转后的目标。为了隔离这一差距,我们引入了RotOutBench,一个涵盖开放视觉案例和受控文本图像旋转的配对诊断基准。一个明显的模式出现了:许多VLM在直接给出原始或旋转图像时能够识别相关内容,但仅从原始图像推断旋转结果时却失败。在受控文本图像旋转中,即使对于具有高直接读取准确性的模型,预测旋转的准确性也降至接近零。模型级别的案例研究进一步表明,预测状态可以接近旋转图像读取状态,而最终读出仍向原始字符串偏移。当前的VLM在展示变换后的视觉状态时能够识别,但往往无法从原始视角预测该状态。

英文摘要

Can vision-language models predict what a 180° rotation would reveal from the original image alone? We study this ability through Rotated-Outcome Prediction: given an original image, a model must answer what would be seen or read after a 180° in-plane rotation, without directly observing the rotated target. To isolate this gap, we introduce RotOutBench, a paired diagnostic benchmark spanning open visual cases and controlled text-image rotations. A sharp pattern emerges: many VLMs can recognize the relevant content when directly given either the original or rotated image, yet fail to infer the rotated result from the original image alone. On controlled text-image rotations, predicted-rotation accuracy collapses to near zero even for models with high direct-reading accuracy. A model-level case study further shows that the prediction state can approach a rotated-image reading state, while the final readout still shifts toward the original string. Current VLMs can recognize a transformed visual state when it is shown, but often fail to predict that state from the original view.

2606.07640 2026-06-09 cs.CV cs.AI cs.LG 新提交

No Free Lunch for Synthetic Images under Data Scarcity Conditions

数据稀缺条件下合成图像的无免费午餐定理

Borja Arroyo Galende, Alejandro Almodóvar, Patricia A. Apellániz, Juan Parras, Silvia Uribe, Santiago Zazo

发表机构 * Universidad Politécnica de Madrid(马德里理工大学) Universidad de Alcalá(阿尔卡拉大学)

AI总结 研究数据稀缺和隐私敏感条件下合成数据的保真度、隐私和效用权衡,提出联合评估框架,比较VAE、GAN和DDPM在三个图像数据集上的表现,发现GAN和DDPM在差分隐私下更鲁棒。

详情
AI中文摘要

本研究探讨了在数据稀缺和隐私敏感条件下,合成数据生成中保真度、隐私和效用之间的权衡。我们提出了一个联合评估这三个维度的框架,并将其应用于三种广泛使用的生成模型:VAE、GAN和DDPM。评估涵盖三个图像数据集:MNIST、OCTMNIST和OrganAMNIST,包括通用和医学成像领域。在训练过程中引入差分隐私机制时,三种模型的行为出现了显著差异。GAN和DDPM表现出更强的鲁棒性,在一系列噪声水平下保持较高的保真度和下游效用,而VAE随着隐私约束的增加而更快地退化。本研究强调了深度生成模型多维评估的重要性,并指出应用隐私技术时它们的行为存在显著差异。

英文摘要

This study investigates the trade-offs between fidelity, privacy, and utility in synthetic data generation under conditions of data scarcity and privacy sensitivity. We propose an evaluation framework that jointly assesses these three dimensions and apply it to three widely used generative models, VAE, GAN, and DDPM. The evaluation spans three image datasets, MNIST, OCTMNIST, and OrganAMNIST, encompassing both general-purpose and medical imaging domains. Notable differences arise between the three models in their behaviour when differential privacy mechanisms are introduced during training. GAN and DDPM demonstrate greater robustness, maintaining higher fidelity and downstream utility across a range of noise levels, while VAE degrades more rapidly as privacy constraints increase. This study highlights the importance of a multidimensional evaluation of deep generative models, also noting that their behaviour significantly differs when privacy techniques are applied.

2606.07639 2026-06-09 cs.CV cs.AI 新提交

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

MOSS-Video-Preview: 通过交叉注意力实现实时视频理解

Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出双通道交叉注意力架构MOSS-Video-Preview,通过非阻塞感知与生成实现实时视频理解,在单H200上实现5倍首词加速和2.7倍解码吞吐提升。

详情
AI中文摘要

视频理解正从离线范式——将完整录制的视频作为输入并在结束后产生单一答案——转向实时交互,其中模型在回复的同时感知新帧,随着新证据的出现修正答案,并在无话可说时保持沉默。我们提出MOSS-Video-Preview来验证这一范式。我们的核心主张是感知不能被生成阻塞;其自然实现是双通道架构。我们认为,交叉注意力主干比流行的仅解码器设计更适合实时视觉-语言融合:视觉特征通过侧通道进入,而不是加入自回归序列,因此感知和生成在独立的、非阻塞的路径上运行——降低了视觉处理的频率,并为独立压缩提供了清晰的通道级接口。我们辅以数据合成流水线,将密集字幕转换为实时理解问答,其答案被修正以匹配模型迄今为止感知到的内容,并在此数据上专门训练离线模型以引发实时行为。我们的模型总体上落后于强大的Qwen2.5-VL-7B基线——这一差距我们主要归因于数据和规模而非架构——但在离线视频和多模态理解上具有竞争力,在实时应用核心的空间和细粒度时间推理上保持稳健,并获得了离线模型缺乏的行为:持续感知、答案修正和及时沉默。在单个H200上,每视频256帧,它实现了约5倍的首词时间加速和2.7倍的解码吞吐提升,离线能力几乎没有下降。我们对范式、架构和数据的研究勾勒出通往实时视频理解的可行路径。

英文摘要

Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

2606.07638 2026-06-09 cs.CV cs.AI 新提交

Anchor-Conditioned Compositional Control for Landscape Image Generation

基于锚点条件的景观图像生成组合控制

Gadha Lekshmi P, Govind Arun, Rohith Syam, Ahmed Elgammal

发表机构 * Rutgers University–New Brunswick(罗格斯大学新布朗斯维克分校) University of Maryland–College Park(马里兰大学帕克分校) University of Technology Sydney(悉尼科技大学)

AI总结 提出锚点条件微调框架,通过解耦交叉注意力机制注入四维组合锚点向量,实现景观图像生成中的组合控制,在水平线检测和三分法对齐上取得最优性能。

详情
Comments
Accepted to the International Conference on Computational Creativity, ICCC 2026
AI中文摘要

图像生成模型虽然被广泛用作创意工具,但对摄影师和视觉艺术家常规执行的组合控制类型支持有限。本文提出了一个用于景观图像生成的锚点条件微调框架的早期结果,其中从训练图像中提取四维组合锚点向量,并通过带有傅里叶编码和三路分类器自由引导丢弃的解耦交叉注意力机制注入扩散模型。与基线和三个消融变体的定量评估表明,所提出的架构实现了最高的水平线检测率0.850和最高的三分法对齐度0.817。类别特定的消融进一步表明,在组合同质场景子集上训练相比混合训练可将水平线偏差降低多达40%。这确立了组合控制精度是类别依赖的。

英文摘要

Image generative models, though widely used as creative tools, offer limited support for the kind of compositional control that photographers and visual artists routinely exercise. This paper presents early results on an anchor conditioned finetuning framework for landscape image generation, in which a four dimensional compositional anchor vector is extracted from training images and injected into a diffusion model via a decoupled cross attention mechanism with Fourier encoding and three way classifier free guidance dropout. Quantitative evaluation against a baseline and three ablation variants shows that the proposed architecture achieves the highest horizon detection rate of 0.850 and the highest rule of thirds alignment of 0.817. A category specific ablation further demonstrates that training on compositionally homogeneous scene subsets reduces horizon deviation by up to 40 percent compared to mixed training. This establishes that compositional control precision is category dependent.

2606.07635 2026-06-09 cs.CV cs.AI 新提交

NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis

NeuroAlign: 用于MCI分析的动态与结构性神经影像的分层多模态融合

Xiongri Shen, Zhenxi Song, Jiaqi wang, Yi Zhong, Leilei Zhao, Chenqi Xu, Linling Li, Yichen Wei, Lingyan Liang, Demao Deng, Luping Song, Ping Luan, Ahmed M. Anter, Shuqiang Wang, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)人工智能学院智能科学与工程学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Guangdong Key Laboratory of Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University(深圳大学医学部生物医学工程学院广东省生物医学测量与超声成像重点实验室) Department of Radiology, The People’s Hospital of Guangxi Zhuang Autonomous Region, Guangxi Academy of Medical Sciences(广西壮族自治区人民医院放射科,广西医学科学院) Shenzhen Sixth People’s Hospital (Nanshan Hospital), Huazhong University of Science and Technology Union Shenzhen Hospital(华中科技大学协和深圳医院(深圳市第六人民医院)) School of Basic Medical Sciences, Shenzhen University(深圳大学基础医学院) Egypt-Japan University of Science and Technology (E-JUST)(埃及日本科技大学) School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Shenzhen University Medical School(深圳大学医学部生物医学工程学院,国家地方联合医学超声关键技术工程实验室,广东省生物医学测量与超声成像重点实验室)

AI总结 提出NeuroAlign框架,通过双模态分层对齐和双域分层交互融合fMRI与DTI特征,实现MCI/SCD检测,并设计无梯度归因方法SAM进行特征分析。

详情
AI中文摘要

功能磁共振成像(fMRI)和弥散张量成像(DTI)的多模态神经影像融合为认知障碍分析提供了互补信息,但仍面临异构特征空间和表示不对齐的挑战。我们提出\textit{NeuroAlign},一个用于结构化多模态融合的分层框架。它引入了(1)\textit{双模态分层对齐}(DMHA),该模块建模多尺度动态连接并对齐动态-静态和功能-结构嵌入;以及(2)\textit{双域分层交互}(DDHI),该模块实现连接级和区域级特征之间的细粒度调制和全局交互。为了支持特征级检查,我们设计了\textit{协同激活映射}(SAM),一种针对DFC、SFC、ALFF和FA的无梯度、面向标记的归因方法。在GUTCM、ADNI和OASIS数据集上通过五折验证评估,NeuroAlign在MCI/SCD检测中取得了竞争性结果,并展示了初步的跨数据集可迁移性。归因分析揭示了模态特异性和部分一致的脑区模式,为多模态表示分析提供了模型驱动的证据。

英文摘要

Multimodal neuroimaging fusion of functional MRI (fMRI) and diffusion tensor imaging (DTI) provides complementary information for cognitive impairment analysis, but remains challenged by heterogeneous feature spaces and misaligned representations. We propose \textit{NeuroAlign}, a hierarchical framework for structured multimodal fusion. It introduces (1) \textit{Dual-Modal Hierarchical Alignment} (DMHA), which models multi-scale dynamic connectivity and aligns dynamic-static and functional-structural embeddings; and (2) \textit{Dual-Domain Hierarchical Interaction} (DDHI), which enables fine-grained modulation and global interaction between connectivity- and region-level features. To support feature-level inspection, we design \textit{Synergistic Activation Mapping} (SAM), a gradient-free, marker-oriented attribution method for DFC, SFC, ALFF, and FA. Evaluated on GUTCM, ADNI, and OASIS under five-fold validation, NeuroAlign achieves competitive MCI/SCD detection and preliminary cross-dataset transferability. Attribution analyses reveal modality-specific and partially consistent brain patterns, providing model-derived evidence for multimodal representation analysis.

2606.07631 2026-06-09 cs.LG cs.AI cs.CY 新提交

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

监督微调中涌现失调的性状空间监测

Huy Nghiem, Sy-Tuyen Ho, Sarah Wiegreffe, Hal Daumé

发表机构 * University of Maryland(马里兰大学)

AI总结 提出利用激活空间中的性状方向监测监督微调中的涌现失调,通过低维几何特征实现高效检测,在7-9B模型上达到0.990 AUROC。

详情
Comments
First version. 45 pages
AI中文摘要

涌现失调(EM)发生在窄微调导致模型在微调任务之外出现危险行为时。标准训练信号可能忽略这种偏移,如果依赖重复的行为评估,可靠检测的成本会很高。我们探究是否可以在微调期间从内部表示中检测涌现失调。利用激活空间中编码为线性方向的七个对齐相关性状,我们在四个开源7-9B大语言模型的训练检查点中跟踪表示漂移。EM相关漂移集中在解释65.5%方差的低维轴上,揭示了所研究机制中的几何特征。基于该漂移轮廓构建的低开销监测器在保留的扰动类型上检测危险检查点,假阴性率为2.2%,假阳性率为2.9%,AUROC为0.990,优于无监督PCA和SAE基线。在两个14B模型、更长的微调运行以及失调起始点上的压力测试确定了关键的部署边界。这些结果将性状空间监测定位为基于LoRA的微调中EM检测的行为评估的实用补充,同时表明在显著不同机制下的部署可能需要重新校准。

英文摘要

Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using seven alignment-relevant traits encoded as linear directions in activation space, we track representational drift across training checkpoints in four open-source 7-9B LLMs. EM-relevant drift concentrates on a low-dimensional axis that explains 65.5% of the variance, revealing a geometric signature in the studied regime. A low-overhead monitor built on this drift profile detects dangerous checkpoints with 2.2% false negative rate, 2.9% false positive rate, and 0.990 AUROC on held-out perturbation types, outperforming unsupervised PCA and SAE baselines. Stress tests on two 14B models, longer finetuning runs, and misaligned starting points identify key deployment boundaries. These results position trait-space monitoring as a practical complement to behavioral evaluation for EM detection during LoRA-based finetuning, while showing that deployment across substantially different regimes may require recalibration.

2606.07630 2026-06-09 cs.LG cs.AI stat.ML 新提交

Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance

基于基础模型先验的主动学习:类别不平衡下的高效学习

Jiancheng Zhang, Meiqing Li, Qi Zhang, Yinglun Zhu

发表机构 * University of California, Riverside(加州大学河滨分校) Carnegie Mellon University(卡内基梅隆大学) Worcester Polytechnic Institute(伍斯特理工学院)

AI总结 针对现实数据中的类别不平衡和噪声标注问题,提出一种利用基础模型先验的主动学习框架,通过不平衡感知的协同决策选择信息量最大的样本,在图像和文本数据集上实现超过50%的标注节省。

详情
Comments
To appear at ICML 2026
AI中文摘要

现实世界中图像和文本领域的数据集通常具有偏斜的类别分布和噪声标注,这共同降低了模型性能,尤其是对少数类。在现有解决方案中,主动学习通过选择性地查询信息最丰富且平衡的样本进行标注,提供了一种有效且高效的范式。我们提出了一种创新的主动学习框架,该框架减轻了类别不平衡,并选择信息量最大的样本进行标注。利用基础模型先验,我们的算法使得基础模型和小模型之间能够进行不平衡感知的协同决策,以处理跨领域的有噪声和不平衡标签。我们首次系统性地研究了在图像和文本领域中标签噪声和类别不平衡双重挑战下的主动学习。在不平衡数据集上的大量实验表明,我们的方法实现了显著的标注节省——与最佳主动学习基线相比超过50%——同时保持了对标签噪声的性能和鲁棒性。

英文摘要

Real-world datasets across image and text domains are often characterized by skewed class distributions and noisy annotations, which jointly degrade model performance, particularly on minority classes. Among existing solutions, active learning offers an effective and efficient paradigm by selectively querying the most informative and balanced samples for annotation. We propose an innovative active learning framework that mitigates class imbalance and selects the most informative samples to annotate. Leveraging foundation model priors, our algorithm enables imbalance-aware co-decisions between foundation model and small model to tackle noisy and imbalanced labels across various domains. We introduce the first study to systematically explore active learning under the dual challenges of label noise and class imbalance across image and text domains. Extensive experiments on imbalanced datasets demonstrate that our method achieves substantial annotation savings-over 50% compared to the best active learning baseline-while preserving performance and robustness to label noise.

2606.07627 2026-06-09 cs.LG math.AT math.CT 新提交

Learning Transfers: Kan Extensions for Neural Invariants

学习迁移:神经不变量的Kan扩展

Luciano Melodia

发表机构 * Friedrich-Alexander Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡大学)

AI总结 提出用范畴论中的Kan扩展形式化迁移学习中的结构不变量,定义传递差异度量,并在链复形和持久模块中给出有限余核公式,通过瓶颈距离计算持久值不变量,实验验证了该方法能识别正确的任务函子并检测破坏迁移相关拓扑的表征坍塌。

详情
AI中文摘要

迁移学习假设在源任务上学习到的表征携带的结构在相关目标任务上仍然可用。标准评估通过目标准确率或分布差异来探测,但未明确说明哪种结构不变量被迁移。我们以范畴论的方式提供了这一不变量。源任务范畴$\mathcal A$、目标任务范畴$\mathcal B$和任务变化函子$J:\mathcal A\to\mathcal B$决定了,对于每个不变量值的源表征$F:\mathcal A\to\mathcal V$,存在通用的迁移不变量$\operatorname{Lan}_J F$。给定目标不变量$G:\mathcal B\to\mathcal V$,我们定义迁移差异$\operatorname{Comp}_J(F,G)=\sup_{b\in\operatorname{Ob}(\mathcal B)} d_{\mathcal V}\bigl((\operatorname{Lan}_J F)(b),G(b)\bigr)$,该评估不是通过源和目标的对象级比较,而是将目标不变量与由指定任务变换强制得到的不变量进行比较。我们证明了链复形和持久模块中$(\operatorname{Lan}_J F)(b)$的有限余核公式,其索引由逗号范畴$J\downarrow b$给出。对于持久值有限型单参数不变量,差异通过条形码之间的瓶颈距离精确计算。在神经潜在点云上的控制实验测试了该分数是否能恢复正确的任务函子,并检测出那些保持分类准确率但破坏迁移相关拓扑的表征坍塌。

英文摘要

Transfer learning presumes that a representation learned on source tasks carries structure that remains usable on related target tasks. Standard evaluations probe this through target accuracy or distributional discrepancy, yet leave unspecified which structural invariant is meant to transfer. We supply that invariant categorically. A source task category $\mathcal A$, a target task category $\mathcal B$, and a task-change functor $J:\mathcal A\to\mathcal B$ determine, for every invariant-valued source representation $F:\mathcal A\to\mathcal V$, the universal transferred invariant $\operatorname{Lan}J F$. Given a target invariant $G:\mathcal B\to\mathcal V$, we define the transfer discrepancy $\operatorname{Comp}J(F,G)=\sup{b\in\operatorname{Ob}(\mathcal B)} d{\mathcal V}\bigl((\operatorname{Lan}_J F)(b),G(b)\bigr)$, evaluating transfer not by an objectwise comparison of source and target, but by comparing the target invariant against the one forced by the prescribed task transformation. We prove finite cokernel formulas for $(\operatorname{Lan}_J F)(b)$ in chain complexes and persistence modules, indexed by the comma category $J\downarrow b$. For persistence-valued finite-type one-parameter invariants, the discrepancy is computed exactly by bottleneck distances between barcodes. Controlled experiments on neural latent point clouds then test whether the score recovers the correct task functor and flags representation collapses that preserve classification accuracy while destroying transfer-relevant topology.

2606.07626 2026-06-09 cs.CV cs.AI 新提交

Eyes All Around: Design and Analysis of 360-Degree LiDAR Perception Using Equivariant Feature Learning in Unstructured Traffic

全方位视角:非结构化交通中基于等变特征学习的360度LiDAR感知设计与分析

Pranav Darshan, Raghuveer Narayanan Rajesh, M Uttara Kumari

发表机构 * RV College of Engineering(RV工程学院)

AI总结 针对非结构化城市交通中感知难题,提出结合扇形全景处理与旋转等变稀疏卷积的360度LiDAR感知框架,在印度城市交通数据集上验证了多类别检测性能。

详情
AI中文摘要

密集非结构化城市交通中的感知仍然是自动驾驶的主要挑战,原因是道路使用者种类繁多、频繁遮挡、不规则运动模式以及缺乏标准化的道路布局。尽管基于LiDAR的3D目标检测器在结构化驾驶场景中表现出色,但大多数是为有限视场设置开发和评估的,其在全环绕360度感知下的行为仍不明确。本文研究了用于自动驾驶的360度LiDAR感知流水线,特别关注全景感知、方位角扇形空间处理以及复杂城市场景中的变换等变特征提取。本文提出了一个实用的360度感知框架,将扇形全景处理与旋转等变稀疏卷积相结合,并在一个自定义的Ouster OS0 LiDAR数据集上评估其行为,该数据集收集自多样化的印度城市交通条件。结果显示,多个目标类别的检测总体稳定,其中汽车性能最强(92.02/90.51),公交车为80.53/76.34,卡车为78.59/74.16,而行人(67.45/61.02)、骑自行车者(73.21/69.54)和骑摩托车者(71.20/68.13)得分较低,反映了在密集城市场景中检测更小且更多变的道路使用者的更大难度。

英文摘要

Perception in dense, unstructured urban traffic remains a major challenge for autonomous driving because of the wide variety of road users, frequent occlusions, irregular motion patterns, and the lack of standardized road layouts. Although recent LiDAR based 3D object detectors have shown strong performance in structured driving scenarios, most are developed and evaluated for limited field of view settings, and their behavior under full surround 360-degree sensing is still not well understood. This paper studies a 360-degree LiDAR perception pipeline for autonomous driving, with particular attention to panoramic sensing, azimuthal sector wise spatial processing, and transformation equivariant feature extraction in complex urban scenes. The paper presents a practical 360-degree perception framework that combines sector wise panoramic processing with rotation equivariant sparse convolutions and evaluates its behavior on a custom Ouster OS0 LiDAR dataset collected across diverse Indian urban traffic conditions. The results show generally stable detection across several object classes, with the strongest performance for cars at 92.02/90.51, buses at 80.53/76.34, and trucks at 78.59/74.16, while lower scores for pedestrians at 67.45/61.02, cyclists at 73.21/69.54, and motorcyclists at 71.20/68.13 reflect the greater difficulty of detecting smaller and more variable road users in dense urban scenes.

2606.07624 2026-06-09 cs.LG 新提交

Sequential statistical inference for Large Language Models: Representation, validity, and monitoring

大语言模型的序贯统计推断:表示、有效性与监控

Yao Xie

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(佐治亚理工学院工业与系统工程系)

AI总结 本文提出将序贯统计推断应用于大语言模型可信赖性,围绕表示、有效性和监控三个任务展开,将LLM交互视为依赖随机过程,提供不确定性保证并检测行为变化。

详情
Comments
This article was prepared for a invited discussion in The American Statistician
AI中文摘要

本讨论认为序贯统计推断可以自然地促进大语言模型的可信赖性。在部署中,LLM系统被反复查询,条件依赖于不断变化的上下文,并整合用户或工具反馈,在模型更新或分布变化后可能表现出行为转变。讨论围绕三个任务组织:表示,将LLM交互建模为依赖随机过程而非孤立的提示-响应对;有效性,开发在依赖、重复使用和适应下仍有意义的不确定性保证;以及监控,使用序贯警报和变化点检测来识别校准、幻觉率、拒绝行为、公平性或其他任务相关属性的变化。这一视角通过将可信赖的LLM部署视为统计过程控制问题,补充了最近的综述。

英文摘要

This discussion argues that sequential statistical inference can naturally contribute to LLM trustworthiness. In deployment, LLM systems are queried repeatedly, conditioned on evolving contexts, and incorporate user or tool feedback, and may exhibit behavioral shifts after model updates or distribution changes. The discussion is organized around three tasks: representation, modeling LLM interactions as dependent stochastic processes rather than isolated prompt--response pairs; validity, developing uncertainty guarantees that remain meaningful under dependence, repeated use, and adaptation; and monitoring, using sequential alarms and change-point detection to identify shifts in calibration, hallucination rates, refusal behavior, fairness, or other task-relevant properties. This perspective complements recent surveys by viewing trustworthy LLM deployment as a problem of statistical process control.

2606.07623 2026-06-09 cs.LG cs.LO 新提交

Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models

上下文确定性有限证书与语言模型中涌现的阈值理论

Faruk Alpay, Hamdi Alakkad

发表机构 * Bahcesehir University(巴切谢希尔大学)

AI总结 提出用有限语义证书验证上下文条件语言模型行为,证明有限域线性任务族中确定性准则,并证明阈值涌现的反幻象定理,将阈值度量与语义置信度分离。

详情
Comments
40 pages; ancillary files provided
AI中文摘要

本文开发了一个模型论框架,通过用有限语义证书替代基准标签来验证上下文条件语言模型行为。第一个问题是有限确定性:上下文中的示例何时在不改变模型参数的情况下强制查询答案?在有限域线性任务族中,我们证明了精确的行空间准则,计算了残差假设数量,推导了完整和查询局部识别曲线,并表明即使对于二元输出,提取最小强制子上下文也是NP完全的。第二个问题是阈值涌现:何时明显的基准跳跃反映语义转换而非评分映射的不连续性?我们证明了一个反幻象定理,将阈值度量与语义置信度分离,并给出了潜在承诺在阈值以上变得可见的速率敏感交叉界。共同的语义对象是可定义事件上的置信度泛函。我们证明它是一个布尔概率测度,等价于相关类型空间上的Keisler测度,其测度一公式构成一个真滤子,且其Stone空间表示在定义扩展下不变。由此产生的演算提供了有限上下文证书、对分隔符击中集、查询教学维度、提示保留准则和尺度极限见证。精确算术辅助脚本重现了有限域和阈值计算,并生成了图表使用的数据。

英文摘要

This paper develops a model-theoretic framework for verifying context-conditioned language-model behavior by replacing benchmark labels with finite semantic certificates. The first problem is finite determinacy: when do examples in a context force the answer to a query without changing model parameters? In finite-field linear task families, we prove an exact row-space criterion, compute the residual hypothesis count, derive full and query-local identification curves, and show that extracting a smallest forcing subcontext is NP-complete even for binary outputs. The second problem is threshold emergence: when does an apparent benchmark jump reflect a semantic transition rather than a discontinuity of the scoring map? We prove an anti-mirage theorem separating thresholded metrics from semantic confidence and give a rate-sensitive crossing bound for latent commitments becoming visible above threshold. The common semantic object is a confidence functional on definable events. We show that it is a Boolean probability measure, equivalently a Keisler measure on the relevant type space, whose measure-one formulas form a proper filter and whose Stone-space representation is invariant under definitional expansion. The resulting calculus provides finite context certificates, pair-separator hitting sets, query teaching dimension, prompt-preservation criteria, and scale-limit witnesses. Exact-arithmetic ancillary scripts reproduce the finite-field and threshold calculations and generate the data used by the figures.

2606.07622 2026-06-09 cs.LG stat.AP 新提交

Airport Terminal Passenger Queue Forecasting for Departure Gates and Security Checkpoints

机场航站楼登机口与安检点旅客排队预测

Juhwan Lee, Seokbin Yoon, Keumjin Lee, Hojong Baik, Seyeon Jung

发表机构 * Korea Aerospace University(韩国航空大学) Korea Airports Corporation(韩国机场公社)

AI总结 提出基于Transformer的框架,利用历史队列长度、等待时间和旅客吞吐量数据,预测登机口和安检点未来两小时的队列长度与等待时间,支持主动排队管理。

详情
Comments
9 pages, 6 figures, accepted at DASC 2026
AI中文摘要

准确的机场航站楼旅客排队预测对于高效的离港运营至关重要,因为它能够实现主动的拥堵管理。然而,时变的旅客需求以及多个离港设施中异构的设施使用情况使得预测具有挑战性。在这项工作中,我们提出了一种旅客排队预测框架,该框架从运营数据中学习历史旅客流量模式。所提出的模型采用基于Transformer的架构,利用过去登机口和安检点的队列长度和等待时间,以及值机岛的旅客吞吐量,来捕捉时间依赖性和设施间相关性。学习到的表示被映射到两个设施特定的MLP头部,以预测登机口和安检点的队列长度和等待时间。实验结果表明,该模型能够准确预测未来两小时内的排队情况。所提出的方法为机场航站楼运营中的主动排队管理和人员重新分配提供了实用的实时决策支持。

英文摘要

Accurate passenger queue forecasting in airport terminals is essential for efficient departure operations, as it enables proactive congestion management. However, time-varying passenger demand and heterogeneous facility usage across multiple departure facilities make forecasting challenging. In this work, we propose a passenger queue forecasting framework that learns historical passenger flow patterns from operational data. The proposed model employs a Transformer-based architecture to capture temporal dependencies and inter-facility correlations using past queue length and waiting time at departure gates and security checkpoints, together with passenger throughput at check-in islands. The learned representations are mapped to two facility-specific MLP heads to predict queue length and waiting time at departure gates and security checkpoints. Experimental results demonstrate accurate forecasts up to two hours ahead. The proposed approach offers practical real-time decision support for proactive queue management and staff reallocation in airport terminal operations.

2606.07621 2026-06-09 cs.LG cs.AI cs.DC 新提交

HASA: Subnet Allocation for Compute-Constrained Model-Heterogeneous Federated Learning

HASA:计算受限的模型异构联邦学习中的子网分配

Amir Hossein Shahdadian, Ahmed M. Abdelmoniem, Mahdi Taheri, Samira Nazari, Christian Herglotz

发表机构 * University of Naples "Federico II"(那不勒斯腓特烈二世大学) Queen Mary University of London(伦敦玛丽女王大学) Brandenburg University of Technology Cottbus-Senftenberg(勃兰登堡工业大学) Tallinn University of Technology(塔林理工大学) University of Zanjan(赞詹大学)

AI总结 提出HASA方法,根据客户端异构性分数分配子网宽度,在固定计算预算下提升平均和最差客户端准确率。

详情
AI中文摘要

边缘服务越来越多地使用联邦学习来个性化设备上的模型,同时将敏感数据保留在本地。在实践中,部署必须处理客户端资源和本地数据分布的异构性。模型异构联邦学习通过允许每个客户端训练共享超网的子网来降低客户端成本,但大多数子网分配策略由设备约束驱动,并未明确考虑统计异构性。本文提出异构感知子网分配(HASA),这是一种仅训练规则,根据从本地训练数据计算的客户端异构性分数分配子网宽度,同时强制执行固定的大小加权计算预算。该设计能够与替代分配策略进行预算匹配的比较。在包含七个客户端的文章标题下一个单词预测基准测试中,HASA在10个匹配种子上的未加权平均客户端测试准确率优于均匀分配,将平均客户端测试准确率从13.82%提高到14.32%,并平均提高了最差客户端准确率。在与代表性部分训练基线的匹配预算比较中,HASA在该基准测试上实现了最强的最差客户端和尾部客户端准确率。方向性消融实验表明,将较小的子网分配给更异构的客户端会降低平均和尾部性能。跨领域图像分类研究进一步表明,异构感知分配的有效性取决于异构性分数反映客户端对额外模型宽度需求的程度。

英文摘要

Edge services increasingly use federated learning to personalize on-device models while keeping sensitive data local. In practice, deployments must handle heterogeneity in both client resources and local data distributions. Model-heterogeneous federated learning lowers client cost by allowing each client to train a subnet of a shared supernet, but most subnet-allocation policies are driven by device constraints and do not explicitly account for statistical heterogeneity. This paper proposes Heterogeneity-Aware Subnet Allocation (HASA), a train-only rule that assigns subnet widths based on client heterogeneity scores computed from local training data while enforcing a fixed size-weighted compute budget. This design enables budget-matched comparisons with alternative allocation policies. On an article-title next-word prediction benchmark with seven clients, HASA improves unweighted mean client test accuracy over uniform allocation across 10 matched seeds, increasing mean client test accuracy from 13.82 percent to 14.32 percent, and improves worst-client accuracy on average. In a matched-budget comparison with representative partial-training baselines, HASA achieves the strongest worst-client and tail-client accuracy on this benchmark. A directionality ablation shows that assigning smaller subnets to more heterogeneous clients degrades both mean and tail performance. A cross-domain image-classification study further shows that the effectiveness of heterogeneity-aware allocation depends on how well the heterogeneity score reflects clients' need for additional model width.

2606.07620 2026-06-09 cs.CV cs.AI cs.DC cs.LG 新提交

SENTRY: Statistical Reliability Analysis of Vision Transformers Under Soft Errors

SENTRY: 视觉Transformer在软错误下的统计可靠性分析

Pramit Kumar Bhaduri, Mahdi Taheri, Samira Nazari, Maksim Jenihhin, Christian Herglotz, Michael Hubner

发表机构 * Brandenburg University of Technology Cottbus-Senftenberg(勃兰登堡工业大学) Tallinn University of Technology(塔林理工大学) Zanjan University(赞詹大学)

AI总结 提出基于有限总体抽样的统计故障注入框架,仅需数千样本即可在99%置信度下以1%误差界估计故障率,将实验成本降低高达10700倍,并揭示ViT中归一化层和关键指数位是脆弱性热点。

详情
AI中文摘要

随着视觉Transformer在自动驾驶和医学成像等安全关键领域的应用增长,确保其抵抗软错误的可靠性至关重要。尽管ViT提供了最先进的准确性,但其庞大的参数数量使得穷举故障注入不可行。为弥补这一差距,本文提出一个统计故障注入框架,利用有限总体抽样理论提供形式化的可靠性保证。我们证明,无论模型规模如何,仅需数千个样本即可在99%置信度下将故障率限制在1%的误差界内。与穷举方法相比,该方法将实验成本降低高达10700倍,同时保留跨架构组件定位脆弱性的能力。通过对ViT-Tiny和ViT-Small等不同架构的广泛评估,我们揭示了高度非均匀的可靠性景观。结果表明,虽然只有3%的FP32位翻转导致故障,但其中绝大多数事件导致灾难性的精度崩溃。具体脆弱性被定位到归一化层和IEEE-754格式中的关键指数位,为设计加固的、边缘部署的ViT架构提供了数学基础和可操作的见解。

英文摘要

With the growth of Vision Transformers in safety-critical domains like autonomous systems and medical imaging, ensuring their reliability against soft errors is paramount. While ViTs offer state-of-the-art accuracy, their massive parameter counts render exhaustive fault injection campaigns infeasible. To bridge this gap, a statistical fault injection framework is presented, leveraging finite-population sampling theory to provide formal reliability guarantees. It is demonstrated that failure rates are bounded within a 1% margin at 99\% confidence using only a few thousand samples, regardless of model scale. This methodology achieves up to a 10,700 times reduction in experimental cost compared to exhaustive approaches, while preserving the ability to localize vulnerabilities across architectural components. Through extensive evaluation of different architectures like ViT-Tiny and ViT-Small, a highly non-uniform reliability landscape is uncovered. It is shown that while only 3% of FP32 bit-flips result in failure, the vast majority of these events lead to catastrophic accuracy collapse. Specific vulnerabilities are localized to normalization layers and critical exponent bits within the IEEE-754 format, providing a mathematical foundation and actionable insights for the design of hardened, edge-deployed ViT architectures.

2606.07618 2026-06-09 cs.LG cs.AI cs.CV 新提交

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

ScaleSweep: 通过块尺度初始化实现LLM的精确NVFP4训练后量化

Li Lin, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所)

AI总结 提出ScaleSweep方法,通过扫描可行块尺度候选并选择最小化目标函数的候选,优化NVFP4量化中的尺度初始化,理论推导扫描范围边界,在Llama和Qwen模型上提升量化性能,缩小与全精度的差距。

详情
Comments
under review
AI中文摘要

NVFP4是一种最近引入的硬件支持的FP4格式,通过细粒度块尺度提高了4位量化的保真度。然而,现有的NVFP4尺度初始化方法仍然主要依赖于AbsMax初始化,这与最优解之间存在明显差距。为了解决这个问题,我们提出了ScaleSweep,一种简单高效的尺度优化方法,它扫描可行的块尺度候选,并选择最小化目标函数的候选。我们进一步提供了NVFP4量化的理论分析,并推导了在原始张量与量化重建张量之间的均方误差(MSE)和加权均方误差(WMSE)下所需扫描范围的上下界。所提出的界限大幅减少了扫描空间,同时保留了最优候选,使得与基线量化算子相比开销可忽略。在Llama和Qwen模型上的实验表明,ScaleSweep持续优于现有的初始化方法,并进一步缩小了与全精度的差距。特别是在对权重、激活、KV缓存和查询状态进行激进的全端到端量化时,ScaleSweep保留了超过93%的全精度性能。

英文摘要

NVFP4 is a recently introduced hardware-supported FP4 format that improves the fidelity of 4-bit quantization through fine-grained block scales. However, existing NVFP4 scale initialization methods still primarily rely on AbsMax initialization, which leaves a noticeable gap to the optimal solution. To address this, we propose ScaleSweep, a simple and efficient scale optimization method that sweeps over feasible block scale candidates and selects the candidate that minimizes a target objective. We further provide a theoretical analysis of NVFP4 quantization and derive both lower and upper bounds for the required sweep range under mean square error (MSE) and weighted mean square error (WMSE) between the original tensor and the quantized reconstructed tensor. The proposed bounds substantially reduce the sweep space while preserving the optimal candidate, enabling negligible overhead compared with the baseline quantization operators. Experiments on Llama and Qwen models demonstrate that ScaleSweep consistently improves quantization performance over existing initialization methods and further narrows the gap to full precision. In particular, under aggressive end-to-end quantization of weights, activations, KV cache, and query states, ScaleSweep preserves more than 93% of the full-precision performance.

2606.07614 2026-06-09 cs.LG stat.AP 新提交

Measuring Poverty and Inequality with Reduced Data: A Machine Learning Approach Using Nigerian Household Data

用缩减数据衡量贫困与不平等:基于尼日利亚住户数据的机器学习方法

Vanesa Jordá, Miguel Niño-Zarazúa

发表机构 * Cantabria University(坎塔布里亚大学) SOAS University of London(伦敦大学亚非学院) United Nations University World Institute for Development Economics Research (UNU-WIDER)(联合国大学世界发展经济学研究所)

AI总结 本文利用随机森林递归特征消除法分析尼日利亚调查数据,发现少量预测因子即可高精度识别贫困状态和不平等线位置,表明机器学习可优化调查设计并降低数据需求。

详情
AI中文摘要

可靠衡量收入和消费对于监测中低收入国家的贫困与不平等至关重要,但完整的住户调查成本高昂且难以定期实施。本文探讨缩减调查工具能否保留关键分布信息。我们应用随机森林递归特征消除法(RF-RFE)对2018/19年尼日利亚通用住户调查面板数据进行分析,识别最能将个体划分到福利分布中的收入来源、消费类别和住户特征。分析聚焦三个结果:贫困状态、在五等分分布中的位置以及相对于基于基尼系数的不平等线的位置。调查的种植后和收获后阶段使我们能够评估不同季节背景下的表现。结果表明,RF-RFE在少量预测因子下实现了强分类准确率。对于消费,使用少量支出类别即可准确预测贫困状态和不平等线位置,而五等分分类对季节性消费达到约80%的准确率,对从单次季节性访问预测的年消费达到60-65%的准确率。对于收入,使用五个预测因子贫困状态准确率约达90%,不平等线位置主要由劳动收入捕获。研究结果表明,机器学习方法有助于改进调查设计并减少数据需求,同时保留衡量和监测贫困与不平等所需的大部分分布信息。

英文摘要

Reliable measurement of income and consumption is essential for monitoring poverty and inequality in low- and middle-income countries, yet full household surveys are costly and difficult to implement regularly. This paper examines whether reduced survey instruments can preserve key distributional information. We apply Random Forest Recursive Feature Elimination (RF-RFE) to the 2018/19 Nigeria General Household Survey-Panel to identify the income sources, consumption categories and household characteristics that best classify individuals within the welfare distribution. The analysis focuses on three outcomes: poverty status, location in the quintile distribution and position relative to the Gini-based inequality line. The survey's post-planting and post-harvest periods allow us to assess performance under different seasonal contexts. Results show that RF-RFE achieves strong classification accuracy with few predictors. For consumption, poverty status and inequality-line position are accurately predicted using a small set of expenditure categories, while quintile classification reaches about 80 percent accuracy for seasonal consumption and 60--65 percent for annual consumption predicted from a single seasonal visit. For income, poverty status reaches around 90 percent accuracy with five predictors, and inequality-line position is largely captured by labour earnings. The findings suggest that machine-learning methods can help improve survey design and reduce data requirements while retaining much of the distributional information needed to measure and monitor poverty and inequality.

2606.07610 2026-06-09 cs.LG cs.AI cs.CL 新提交

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

LEAF: 无需分支的树生长方法用于语音感知大语言模型后训练

Argyrios Gerogiannis, Yekaterina Yegorova, Mark Hasegawa-Johnson, Venugopal V. Veeravalli

发表机构 * University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对语音感知大语言模型后训练中GRPO方法粗粒度信用分配问题,提出LEAF方法,通过回溯式树结构学习、高信息量边界选择和跨度级优势分配,在语音问答和翻译任务上超越GRPO。

详情
Comments
15 pages, 3 figures, 11 tables
AI中文摘要

最先进的GRPO风格方法在语音感知大语言模型后训练中存在粗粒度信用分配问题,将相同的终端奖励优势广播给响应中的每个token。这忽略了rollout批次中的有用结构,其中语音条件下的补全通常共享前缀,然后在重要决策处出现分歧。我们提出低秩探索自适应分叉(LEAF),一种基于回溯树的强化学习方法,无需在线分支或额外解码即可恢复这种结构。LEAF采样完整响应,选择高信息量边界,按共享前缀分组响应,并使用后代奖励分配跨度级优势。我们从理论上证明了LEAF的跨度级信用分配和边界选择设计。实验上,在相同的rollout和低秩适应预算下,LEAF在语音问答和语音翻译基准上优于GRPO。值得注意的是,较小的LEAF训练模型优于当前最先进的完全参数基线。

英文摘要

State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful structure within rollout batches, where speech-conditioned completions often share prefixes before diverging at important decisions. We propose Low-rank Exploration with Adaptive Forking (LEAF), a retrospective tree-based RL method that recovers this structure without online branching or additional decoding. LEAF samples complete responses, selects high-surprisal boundaries, groups responses by shared prefixes, and assigns span-level advantages using descendant rewards. We theoretically justify LEAF's span-level credit assignment and boundary-selection design. Empirically, LEAF improves over GRPO across speech question answering and speech translation benchmarks under the same rollout and low-rank adaptation budget. Notably, smaller LEAF-trained models outperform current state-of-the-art, full-parameter baselines.

2606.07608 2026-06-09 cs.CL cs.AI cs.LG cs.SD 新提交

Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

针对瑞士德语音识别的Whisper字幕对齐微调:基准污染、惯例不匹配以及25.6% WER(13.8% cWER)的诚实基线

Felix Akeret

发表机构 * Independent Researcher, Zurich, Switzerland(独立研究员,瑞士苏黎世) ETH Zürich(苏黎世联邦理工学院) University of Bern(伯尔尼大学) FHNW(西北应用科学与艺术大学) CeTIM Leiden/Munich(CeTIM 莱顿/慕尼黑)

AI总结 通过1,367小时广播语音与标准德语字幕的弱监督,系统微调Whisper large-v3用于瑞士德语音识,发现公开结果因基准污染被高估,并发布两个诚实评估的模型。

详情
Comments
15 pages, 21 tables. Models available at https://huggingface.co/Flix-AI
AI中文摘要

我们提出了一项系统研究,针对OpenAI的Whisper large-v3进行微调,用于瑞士德语音识,使用1,367小时的广播语音与标准德语字幕作为弱监督。通过在NVIDIA DGX Spark(Grace Blackwell,128 GB统一内存,最高1 PFLOP FP4)上进行16次迭代训练,我们比较了LoRA和全微调(1.55B参数模型),研究了幻觉的根本原因,并量化了数据质量、字幕对齐和训练策略的影响。我们的最佳模型在严格不相交数据上的诚实评估中,在All Swiss German Dialects Test Set (ASGDTS)上实现了25.6%的测量WER。通过将真实错误与有效的风格变异(时态、词序、瑞士正字法)分离的协调错误分析,得到内容WER (cWER)为13.8%,仅计算实际识别失败。偏差校正估计将其降至8.5%,表明真实错误率约为测量WER的三分之一。\n我们证明,已发表的瑞士德语ASR最先进结果(17.1-17.5% WER)因基准污染而被夸大:一个在ASGDTS测试集上自训练的普通Whisper模型(零瑞士德语数据)实现了13.88% WER,超过了所有已发表系统。使用Phi-4-multimodal的实验显示出更强的记忆效应(3.9% WER),揭示该基准主要衡量惯例匹配而非方言理解。\n我们发布了两个模型,一个LoRA适配器(25.32% WER,13.9% cWER)和一个全微调模型(25.60% WER,13.8% cWER),这是少数公开可用、经过诚实评估的瑞士德语Whisper模型之一,采用Apache 2.0许可,完全可复现,无需机构数据协议。

英文摘要

We present a systematic study of fine-tuning OpenAI's Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired with Standard German subtitles as weak supervision. Through 16 iterative training runs on an NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory, up to 1 PFLOP FP4), we compare LoRA and full fine-tuning of the 1.55B-parameter model, investigate hallucination root causes, and quantify the effect of data quality, subtitle alignment, and training strategy. Our best model achieves 25.6% measured WER on the All Swiss German Dialects Test Set (ASGDTS) in an honest evaluation on strictly disjoint data. A harmonized error analysis separating genuine errors from valid stylistic variation (tense, word order, Swiss orthography) yields a content WER (cWER) of 13.8%, counting only actual recognition failures. Bias-corrected estimation reduces this to 8.5%, suggesting the true error rate is roughly one third of measured WER. We demonstrate that published state-of-the-art Swiss German ASR results (17.1-17.5% WER) are inflated by benchmark contamination: a vanilla Whisper model self-trained on the ASGDTS test set with zero Swiss German data achieves 13.88% WER, surpassing all published systems. Experiments with Phi-4-multimodal show an even stronger memorization effect (3.9% WER), revealing that the benchmark primarily measures convention matching rather than dialectal comprehension. We release two models, a LoRA adapter (25.32% WER, 13.9% cWER) and a full fine-tuned model (25.60% WER, 13.8% cWER), among the few publicly available, honestly evaluated Whisper models for Swiss German, under Apache 2.0 with full reproducibility, requiring no institutional data agreements.

2606.07606 2026-06-09 cs.LG 新提交

QDSP: An Interpretable Structured Learning Framework for Predicting Death or Cerebral Palsy in Very Low Birth Weight Infants

QDSP:一种用于预测极低出生体重婴儿死亡或脑瘫的可解释结构化学习框架

Ling Wang, Xiaolong Li, Hui Zhou, Jing Shi, Fuhao Zhang, Dapeng Chen, Nan Mu

发表机构 * College of Computer Science, Sichuan Normal University(四川师范大学计算机科学学院) West China Second University Hospital, Sichuan University(四川大学华西第二医院)

AI总结 提出QDSP框架,集成配额引导子空间采样和可微决策结构感知,在极低出生体重婴儿队列中实现高精度死亡/脑瘫预测,并提供可解释的临床决策路径。

详情
AI中文摘要

极低出生体重婴儿(VLBWI)面临高死亡风险和严重神经发育障碍(包括脑瘫),但在高维且数据有限的临床环境中,可靠的出院时预后分层仍然具有挑战性。为解决此问题,我们提出QDSP,一种可解释的结构化学习框架,集成配额引导子空间采样(QSS)和可微决策引导结构感知(DSP)。QSS模块通过基于自助法的特征一致性估计构建稳定性感知且低冗余的特征子空间,而DSP模块采用可微软斜决策结构建模非线性临床交互,同时保留可追溯的决策证据。该框架在包含51名婴儿的真实VLBWI队列上评估,并在三个公共医学表格数据集上进一步验证。在主要队列上,QDSP达到0.9200的准确率和0.9714的AUC,优于代表性机器学习和深度表格学习基线,包括XGBoost、TabNet和TabPFN。在外部数据集上,QDSP在不同样本量和临床分布下保持有竞争力的判别力和校准度。此外,基于SHAP的分析和可微决策路径追踪识别出临床相关预测因子,包括囊性脑室周围白质软化(cPVL)和出生体重,与已建立的新生儿病理生理学证据一致。这些结果表明,QDSP为VLBWI出院时风险分层提供了可解释且稳健的框架,并可能支持新生儿重症监护环境中的早期个体化临床决策。

英文摘要

Very low birth weight infants (VLBWI) are at high risk of mortality and severe neurodevelopmental impairment, including cerebral palsy, yet reliable discharge-time prognostic stratification remains challenging in high-dimensional and data-limited clinical settings. To address this problem, we propose QDSP, an interpretable structured learning framework that integrates Quota-guided Subspace Sampling (QSS) and Differentiable-decision-guided Structure Perception (DSP). The QSS module constructs stability-aware and low-redundancy feature subspaces through bootstrap-based feature consistency estimation, whereas the DSP module employs differentiable soft oblique decision structures to model nonlinear clinical interactions while preserving traceable decision evidence. The proposed framework was evaluated on a real-world VLBWI cohort comprising 51 infants and further validated on three public medical tabular datasets. On the primary cohort, QDSP achieved an accuracy of 0.9200 and an AUC of 0.9714, outperforming representative machine learning and deep tabular learning baselines, including XGBoost, TabNet, and TabPFN. Across external datasets, QDSP maintained competitive discrimination and calibration under varying sample sizes and clinical distributions. In addition, SHAP-based analyses and differentiable decision-path tracing identified clinically relevant predictors, including cystic periventricular leukomalacia (cPVL) and birth weight, consistent with established neonatal pathophysiological evidence. These results suggest that QDSP provides an interpretable and robust framework for discharge-time risk stratification in VLBWI and may support early individualized clinical decision-making in neonatal intensive care settings.

2606.07603 2026-06-09 cs.LG cs.AI 新提交

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

MetaEvo:一种基于经验驱动的智能体进化的元优化框架

Bowen Ren, Heyan Huang, Yinghao Li, Yang Gao

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院) Beijing Institute of Technology Southeast Academy of Information Technology(北京理工大学东南信息技术研究院)

AI总结 提出MetaEvo两阶段框架,通过偏好优化增强模型从任务经验中抽象原则的能力,并在模块化架构中积累复用,持续提升推理性能。

详情
AI中文摘要

大型语言模型(LLM)展现出强大的推理能力,但大多数基于LLM的智能体是静态部署的,无法通过任务交互进行改进。现有的经验驱动方法通常依赖于记忆或启发式方法,而不增强模型的学习能力,将其视为被动执行者,导致早期性能平台和有限的长期改进。为了解决这个问题,我们提出了MetaEvo,一个用于持续智能体进化的两阶段框架,专注于改进模型如何从任务经验中学习,而不仅仅是存储什么。MetaEvo首先应用基于偏好的优化来增强模型的原则抽象能力,然后在模块化智能体架构中实现这些原则的积累和重用。在多样化推理基准上的实验结果表明,MetaEvo始终优于强基线,并在迭代中保持可靠的改进。这些发现验证了元优化在使智能体从经验中学习并持续增强其推理能力方面的有效性。

英文摘要

Large language models (LLMs) exhibit strong reasoning capabilities, yet most LLM-based agents are statically deployed and unable to improve through task interactions. Existing experience-driven methods often rely on memory or heuristics without enhancing the model's ability to learn, treating it as a passive executor and leading to early performance plateaus and limited long-term improvement. To address this issue, we propose MetaEvo, a two-stage framework for continual agent evolution that focuses on improving how the model learns from tasks experience, rather than solely on what it stores. MetaEvo first applies preference-based optimization to enhance the model's ability of principle abstraction, then enables the accumulation and reuse of these principles within a modular agent architecture. Experimental results on diverse reasoning benchmarks demonstrate that MetaEvo consistently outperforms strong baselines, maintains reliable improvement across iterations. These findings validate the effectiveness of meta-optimization in enabling agents to learn from experience and continually enhance their reasoning capabilities.

2606.07602 2026-06-09 cs.LG cs.AI 新提交

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

面向LEGO空间物理推理的样本高效后训练

Yuhuan Yuan, Zhouliang Yu, Minghao Liu, Weiyang Liu, Ge Lin Kan

发表机构 * HKUST(GZ)(香港科技大学(广州)) CUHK(香港中文大学) ZODA

AI总结 针对LLM生成LEGO组装时出现的物理有效但几何语义错位问题,提出基于模型的数据选择方法和样本高效强化学习PVPO,结合体素空间几何奖励,提升结构、语义对齐和物理有效性。

详情
Comments
Technical Report V1, 15 pages, 6 figures, 3 tables
AI中文摘要

基于LLM的LEGO组装生成需要同时具备语义基础和物理可行性。我们发现一种数据引发的失败模式PhysHack,其中组装满足物理有效性约束,但产生的结构在几何上错位、语义上不一致或校准不良。为应对这一挑战,我们提出一种基于模型的数据选择方法,仅使用一小部分训练数据,同时改进基于物理的LEGO组装生成。基于所选轨迹,我们引入PVPO,一种样本高效的强化学习方法,将物理可行性与体素空间几何奖励相结合。我们的结果表明,仅物理有效性不足以作为可靠物理推理的代理:模型可以学习生成有效结构而不保持语义或几何保真度。跨模型主干和测试时缩放设置的实验表明,PVPO改善了结构和语义对齐、物理有效性、结构稳定性和校准,同时减少了对大量事后拒绝采样的依赖。特别是,校准结果表明,PVPO通过使测试时选择更能预测语义和结构质量来缓解PhysHack。

英文摘要

LLM-based LEGO assembly generation requires both semantic grounding and physical feasibility. We identify a data-induced failure mode, PhysHack, in which the assemblies satisfy physical-validity constraints while producing structures that are geometrically misaligned, semantically inconsistent, or poorly calibrated. To address this challenge, we propose a model-based data selection approach that uses only a small fraction of the training data while improving physically grounded LEGO assembly generation. Building on the selected trajectories, we introduce PVPO, a sample-efficient reinforcement learning method that couples physical feasibility with voxel-space geometric rewards. Our results show that physical validity alone is an insufficient proxy for reliable physical reasoning: models can learn to generate valid structures without preserving semantic or geometric fidelity. Experiments across model backbones and test-time scaling settings demonstrate that PVPO improves structural and semantic alignment, physical validity, structural stability, and calibration, while reducing reliance on extensive post-hoc rejection sampling. In particular, results on calibration show that PVPO mitigates PhysHack by making test-time selection more predictive of semantic and structural quality.

2606.07600 2026-06-09 cs.LG cs.AI 新提交

Reachability and asymptotics of Gaussian Transformer dynamics

高斯Transformer动力学的可达性与渐近性

Albert Alcalde, Zhengping Ji, Enrique Zuazua

发表机构 * Friedrich–Alexander University Erlangen–Nürnberg(弗里德里希-亚历山大大学埃尔朗根-纽伦堡) Research Council of Norway(挪威研究理事会)

AI总结 将Transformer数据传播建模为概率测度空间上的非线性控制系统,证明高斯分布在自注意力与仿射前馈层下保持高斯性,从而降维为双线性控制系统,并揭示与Riccati方程的联系。

详情
AI中文摘要

我们将通过Transformer(驱动大型语言模型的机器学习架构)的数据传播建模为概率测度空间上的非线性控制系统。对于具有自注意力和仿射前馈层的平均场Transformer模型,我们证明高斯分布在诱导流下保持严格高斯性。这种不变性将无限维测度动力学简化为控制均值和协方差演化的有限维双线性控制系统,将Transformer的表达能力重新表述为关于指定高斯矩的可达性问题,并揭示了与经典滤波和控制中Riccati型方程的新联系。\n对于时变控制,我们证明任何目标高斯分布(其协方差矩阵与初始协方差矩阵具有相同秩)的精确有限时间可达性,该秩约束是动力学的一个内在不变量。对于时不变参数,我们推导出显式的谱条件,这些条件要么导致正定平衡点的渐近稳定性,要么导致协方差的有限时间爆破。\n数值实验补充了理论,表明具有高斯输入的实际Transformer在早期和中间层保持与矩匹配的高斯分布接近,而具有指定注意力矩阵的Transformer再现了预测的协方差状态:在稳定配置中有界演化,在失稳配置中爆破。

英文摘要

We formulate data propagation through the Transformer, the machine learning architecture powering large language models, as a nonlinear control system on the space of probability measures. For the mean-field Transformer model with self-attention and affine feed-forward layers, we prove that Gaussian distributions remain exactly Gaussian along the induced flow. This invariance reduces the infinite-dimensional measure dynamics to a finite-dimensional bilinear control system governing the evolution of the mean and covariance, reformulates the expressive capacity of Transformers as a reachability problem for prescribed Gaussian moments, and reveals a novel connection with Riccati-type equations from classical filtering and control. For time-varying controls, we prove exact finite-time reachability of any target Gaussian distribution whose covariance matrix has the same rank as the initial one, this rank constraint being an intrinsic invariant of the dynamics. For time-invariant parameters, we derive explicit spectral conditions leading either to asymptotic stability toward positive-definite equilibria or to finite-time blow-up of the covariance. Numerical experiments complement the theory by showing that practical Transformers with Gaussian inputs remain close to moment-matched Gaussian distributions through early and intermediate layers, while Transformers with prescribed attention matrices reproduce the predicted covariance regimes: bounded evolution in stabilizing configurations and blow-up in destabilizing ones.

2606.07599 2026-06-09 cs.LG cs.AI cs.CV 新提交

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

DiffoR:一种统一的连续生成框架用于通用序数回归

Hongxu Ma, Lin Wang, Chenghou Jin, Han Zhou, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Kuaishou Technology(快手科技) Shanghai University of Finance and Economics(上海财经大学) Tongji University(同济大学)

AI总结 提出DiffOR框架,将序数回归建模为连续生成任务,利用扩散模型通过迭代去噪恢复连续序数值,并设计双解耦策略(多尺度增量聚合与动态去噪感知)保留序数拓扑,在12个基准上超越现有方法。

详情
Comments
Accepted at KDD 2026
AI中文摘要

序数回归(OR)旨在预测具有内在顺序的目标值,支撑着从推荐系统到计算机视觉等多个领域的关键应用。尽管从朴素回归发展到基于离散化的分类和生成,现有范式仍然受到量化伪影和缺乏全局序数拓扑感知的根本限制。这些方法通常强制执行刚性边界划分,无法捕捉序数数据固有的非平稳语义转换。在本文中,我们提出了一种新范式,将OR形式化为连续生成序数回归任务。在该新范式下,我们引入了DiffOR,一个统一的框架,利用扩散模型通过迭代去噪恢复连续序数值,从而能够动态学习软语义转换。为了显式保留序数拓扑,我们设计了一种双解耦策略:在空间上,多尺度增量聚合将目标分解为层次化的连续增量;在时间上,动态去噪感知将去噪步骤与特征频率同步,确保稳健的从粗到细的细化。理论上,我们证明了所提方法可以显著增强表示能力和机制可解释性。在四个领域的12个基准上的大量实验验证了DiffOR相对于最先进方法的一致优越性,建立了一个新标准,展示了作为通用序数回归通用解决方案的强大潜力。

英文摘要

Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to discretization-based classification and generation, existing paradigms remain fundamentally constrained by quantization artifacts and the lack of global ordinal topological perception. These methods typically enforce rigid boundary delineations, failing to capture the non-stationary semantic transitions inherent to ordinal data. In this paper, we propose a novel paradigm where OR is formulated as a Continuous Generative Ordinal Regression task. Under the novel paradigm, we introduce DiffOR, a unified framework that leverages diffusion models to recover continuous ordinal values via iterative denoising, thereby enabling the dynamic learning of soft semantic transitions. To explicitly preserve ordinal topology, we devise a Dual-Decoupling Strategy: Spatially, Multi-scale Increment Aggregation decomposes targets into hierarchical continuous increments; Temporally, Dynamic Denoising Perception synchronizes denoising steps with feature frequencies, ensuring robust coarse-to-fine refinement. Theoretically, we show that the proposed method can significantly enhance both representation capability and mechanistic interpretability. Extensive experiments on 12 benchmarks across four domains validate DiffOR's consistent superiority over state-of-the-art methods, establishing a new standard that demonstrates strong potential as a general-purpose solution for universal ordinal regression.

2606.07598 2026-06-09 cs.LG cs.AI 新提交

A Topological Characterization of Graph Neural Networks via Stochastic Block Model Embeddings on the n-Sphere

图神经网络的拓扑特征化:通过n-球面上的随机块模型嵌入

Gopal Anantharaman

发表机构 * KnotTheory.ai Inc.(KnotTheory.ai 公司) Dept. of Mathematics, Emporia State University(恩波利亚州立大学数学系)

AI总结 提出将消息传递神经网络诱导的随机块模型映射到单位n-球面的拓扑框架,用于比较训练后的图神经网络,并实现无需重新训练的迁移学习候选检索。

详情
AI中文摘要

我们提出一个拓扑框架,用于比较训练后的图神经网络(GNN),通过将消息传递神经网络(MPNN)在图信号空间上诱导的随机块模型(SBM)映射到单位$n$-球面$\sphere^{n-1}\subset\R^n$上。该构建基于三个经典支柱:割距离图空间$(\Wo,\cutdist)$的紧性\citep{lovasz2006limits,lovasz2012large},Frieze--Kannan弱正则引理及其由\citet{levie2023graphon}推广的图信号扩展,以及MPNN关于割距离的Lipschitz连续性。我们证明,对于任意给定的容差$\varepsilon>0$,一个训练后的MPNN $Φ$作用于足够大的图时,可以通过一个复杂度有界的阶梯图信号(误差不超过$\varepsilon$)来分解,并且我们构造了一个显式的保测映射$Ψ_n\colon[0,1]\to\sphere^{n-1}$,将SBM区域放置在不相交的球冠上。这产生了一个与问题无关的低维训练GNN“指纹”,便于视觉检查和跨模型库的最近邻搜索,从而实现无需重新训练的迁移学习候选检索。我们讨论了高维中测度集中现象带来的障碍——这一现象与大规模语言模型规模的嵌入直接相关。最后,我们提出五个具体的未来研究方向:双曲和格拉斯曼流形替代球面模型,基于图信号的Gromov--Wasserstein距离作为$n$-球面映射的无等距替代,SBM流形的信息几何(Fisher)重新表述,逐层嵌入云的持续同调指纹,以及基于图信号特征分解的谱距离基线。

英文摘要

We propose a topological framework for comparing trained Graph Neural Networks (GNNs) by mapping the Stochastic Block Models (SBMs) induced on the graphon-signal space of a Message Passing Neural Network (MPNN) onto the unit $n$-sphere $\sphere^{n-1}\subset\R^n$. The construction rests on three classical pillars: the \emph{compactness} of the cut-distance graphon space $(\Wo,\cutdist)$ \citep{lovasz2006limits,lovasz2012large}, the Frieze--Kannan \emph{weak regularity lemma} together with its graphon-signal extension due to \citet{levie2023graphon}, and the Lipschitz continuity of MPNNs with respect to the cut-distance. We show that, for any prescribed tolerance $\varepsilon>0$, a trained MPNN $Φ$ acting on a sufficiently large graph factors (up to $\varepsilon$) through a step-graphon-signal of bounded complexity, and we construct an explicit measure-preserving map $Ψ_n\colon[0,1]\to\sphere^{n-1}$ that places the SBM regions on disjoint spherical caps. This produces a problem-agnostic, low-dimensional ``fingerprint'' of a trained GNN that is amenable to visual inspection and to nearest-neighbour search across model zoos, enabling \emph{transfer-learning candidate retrieval} without retraining. We discuss the obstruction posed by concentration of measure in high dimension -- a phenomenon directly relevant to LLM-scale embeddings. We close with five concrete future research directions: hyperbolic and Grassmannian alternatives to the spherical model, Gromov--Wasserstein distances on graphon-signals as an isometry-free alternative to the $n$-sphere map, an information-geometric (Fisher) reformulation of the SBM manifold, persistent-homology fingerprints of layer-wise embedding clouds, and a spectral-distance baseline derived from the graphon eigendecomposition.

2606.07597 2026-06-09 cs.LG cs.AI 新提交

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

重复不匹配:为什么数据混合实验无法扩展以及如何修复

Kevin Zhou, Lisa Alazraki, Kris Cao, Marek Rei

发表机构 * Imperial College London(帝国理工学院) Cohere

AI总结 针对预训练数据混合中因高质量数据重复率变化导致的小规模实验外推失败问题,提出重复控制子采样方法,在1/16目标token预算下实现接近最优混合,揭示了重复动态而非规模决定实验泛化性。

详情
AI中文摘要

预训练数据混合通常通过运行小规模实验并外推到目标训练预算来调整。当高质量数据稀缺且必须重复时,这种外推经常失败,但失败的原因尚未被隔离。我们表明,一个主要原因是重复不匹配:由于高质量数据集很小,它们的重复率随着训练预算的增长而变化,以小规模代理实验未预期的方式改变最优混合。一种匹配目标重复率的子采样程序可以控制这种效应。在结合有限高质量数据和网络爬取的双源设置中,仅使用目标token的1/16的单一重复控制实验即可恢复757M参数模型的最优混合,误差在0.05以内,而无重复控制时误差为0.75。在没有重复控制的情况下达到相当的精度需要三到四个视野,消耗目标token预算的44%到94%。对于三个数据源,更大的混合空间需要不止一个实验来约束,但该方法仍然有效:在757M规模下,仅两个重复控制视野即可恢复最优混合,优于需要完整双源实验构建的基线。我们的结果表明,重复动态(而非仅规模)决定了小规模混合实验是否泛化。更广泛地说,它们表明数据重复应被视为混合优化中的第一类变量,而不是有限数据的不便副作用。

英文摘要

Pre-training data mixtures are commonly tuned by running small-scale experiments and extrapolating to the target training budget. When high-quality data is scarce and must be repeated, this extrapolation frequently fails, but the source of the failure has not been isolated. We show that a primary culprit is a repetition mismatch: because high-quality datasets are small, their repetition rate changes as the training budget grows, shifting the optimal mixture in ways that small-scale proxy experiments do not anticipate. A subsampling procedure that matches the target repetition rate controls for this effect. In a two-source setting combining limited high-quality data with web crawl, a single repetition-controlled experiment using only 1/16 of the target tokens recovers a mixture within 0.05 of the optimum for a 757M parameter model, compared to an error of 0.75 without repetition control. Achieving comparable accuracy without repetition control requires three to four horizons, consuming 44 to 94% of the target token budget. With three data sources, the larger mixture space requires more than a single experiment to constrain, but the approach remains effective: at the 757M scale, just two repetition-controlled horizons recover the optimal mixture, outperforming baselines that instead require the full two-source experiments to construct. Our results reveal that repetition dynamics, not scale alone, shape whether small-scale mixture experiments generalize. More broadly, they suggest that data repetition deserves treatment as a first-class variable in mixture optimization, rather than an inconvenient side effect of limited data.

2606.07596 2026-06-09 cs.LG 新提交

Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

尾部的捷径:通过微调更新的后验谱压缩进行去偏

Edward Sun, Dmitrii Troitskii

发表机构 * UCLA(加州大学洛杉矶分校) Northeastern University(东北大学)

AI总结 提出对微调权重更新进行SVD截断尾部,无需重训练或组标签即可减少虚假关联,在多个模型和基准上以<2%的准确率损失将差距降低最多5倍。

详情
Comments
ICML Weight Space Symmetries Workshop 2026
AI中文摘要

微调常常在引入任务知识的同时引入虚假关联,导致在代表性不足的群体上出现系统性失败。现有的缓解方法需要重训练、组标签或精心设计的反事实数据。我们展示了一种简单的后验干预方法,无需这些条件即可减少捷径依赖:截断 $ΔW = W_\mathrm{ft} - W_\mathrm{base}$ 的SVD尾部,可以在保持任务准确率的同时减少虚假组差距。在三个指令微调模型(0.5B--7B)和四个分类基准上,top-$k$ 截断在每项任务上以<2个百分点的准确率损失减少了差距,在CivilComments上最多减少了5倍。我们提出这是因为捷径响应位于 $ΔW$ 奇异排序的尾部,这是一个关于截断行为而非原始奇异值的论断,原始奇异值分布广泛且在所有四个数据集上看起来相同。一个受控的边界情况(微调只学习一个捷径)显示了预测的FT到基线的崩溃,而bottom-/random-$k$ 和匹配秩的LoRA控制排除了通用低秩近似和秩约束训练作为解释。我们将此视为初步证据,表明 $ΔW$ 的奇异基是研究微调所学内容的有用坐标系。

英文摘要

Fine-tuning often introduces spurious correlations alongside task knowledge, causing systematic failures on underrepresented groups. Existing mitigations require retraining, group labels, or curated counterfactual data. We show a simple post-hoc intervention reduces shortcut reliance without any of these: truncating the tail of the SVD of $ΔW = W_\mathrm{ft} - W_\mathrm{base}$ reduces the spurious-group gap while preserving task accuracy. Across three instruction-tuned models ($0.5$B--$7$B) and four classification benchmarks, top-$k$ truncation reduces the gap on every cell at $<2$ pp accuracy loss, by up to $5\times$ on CivilComments. We propose this works because the shortcut response sits in the tail of the singular ordering of $ΔW$, a claim about how truncation behaves rather than about the raw singular values, which are broadly distributed and look the same across all four datasets. A controlled boundary case in which fine-tuning has only a shortcut to learn shows the predicted FT-to-base collapse, and bottom-/random-$k$ and matched-rank LoRA controls rule out generic low-rank approximation and rank-constrained training as the explanation. We read this as preliminary evidence that the singular basis of $ΔW$ is a useful coordinate system for studying what fine-tuning has learned.

2606.07593 2026-06-09 cs.CV cs.AI 新提交

A Mechanistic Analysis of Adversarial Fine-tuning of Vision Transformers

视觉Transformer对抗微调的机制分析

Hannah Gao, Isha Agarwal, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 通过机制分析研究对抗微调对视觉Transformer在扰动和常规图像上性能的影响,发现微调仅改善特定类型扰动,未改变稀疏表示。

详情
AI中文摘要

图像分类模型在高风险现实场景中的广泛应用要求模型对输入图像的轻微扰动(如模糊或锐化)具有鲁棒性。尽管视觉Transformer(ViT)在现代多模态模型(如视觉-语言模型(VLM)和视觉-语言-动作(VLA)模型)中扮演着不可或缺的角色,但在鲁棒性设置中它们缺乏关注。在这项工作中,我们通过机制视角分析了对抗微调(一种提高模型对图像扰动鲁棒性的流行方法)对ViT在扰动和常规图像上性能的影响。我们在低频和高频图像损坏上对抗训练ViT,并试图通过检查模型的注意力机制、内部表示和知识演化来解释下游模型性能的变化。总体而言,我们的结果表明,虽然对带有常见损坏的输入进行微调提高了模型在新损坏数据实例上的性能和确定性,但这些改进不会转移到训练中未见过的其他类别损坏。此外,尽管观察到各层视觉注意力和知识演化的变化,我们发现对抗训练并未导致ViT学习的稀疏表示发生根本性变化。

英文摘要

The widespread use of image classification models in high-risk, real-world situations necessitates making these models robust to slight disturbances or perturbations, such as blurring or sharpening, in the input images. While vision transformers (ViTs) play an integral role in many modern-day multi-modal models like Vision-Language-Models (VLMs) and Vision-Language-Action (VLA) models, they have received a lack of attention in the setting of robustness. In this work, we analyze the effects of adversarial fine-tuning, a popular method for improving model robustness to image perturbations, on a ViT's performance on perturbed and regular images through a mechanistic lens. We adversarially train a ViT on low-frequency and high-frequency image corruptions, and attempt to explain changes in downstream model performance through an examination of the model's attention mechanisms, internal representations, and knowledge evolution. Overall, our results suggest that, while fine-tuning on inputs with common corruptions improves model performance and certainty on new instances of corrupted data, these improvements do not transfer to other classes of corruptions not seen in the training. Additionally, despite observing changes in visual attention and knowledge evolution across layers, we found that adversarial training did not lead to fundamental changes in the sparse representations learned by ViTs.

2606.07592 2026-06-09 cs.LG 新提交

UNIQ: Conformal Calibration for Adaptive Conservatism in Offline Reinforcement Learning

UNIQ: 离线强化学习中的自适应保守性共形校准

Aditya Upadhyay

发表机构 * IIIT Delhi(印度德里国际信息技术学院)

AI总结 提出UNIQ方法,通过共形预测校准不确定性,实现状态自适应的保守性惩罚,在D4RL基准上以接近IQL的内存开销提升性能。

详情
Comments
19 pages, 2 figures, ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning
AI中文摘要

离线强化学习需要谨慎的保守性来缓解分布偏移,然而大多数现有方法在所有状态上统一施加固定惩罚,而不考虑局部数据覆盖。我们提出UNIQ(不确定性信息分位数),一种通过共形校准不确定性估计引入状态自适应保守性的离线RL方法。基于隐式Q学习(IQL)主干,UNIQ训练一个多期望值集成,使用分裂共形预测计算无分布不确定性估计,并将所得信号映射到状态依赖的期望值,从而在覆盖良好的区域放松保守性,在数据边界附近的不确定区域加强保守性。在D4RL MuJoCo基准上,UNIQ持续优于IQL,在Walker2d和重放密集型任务上提升最大。同时,UNIQ以接近IQL的内存成本(约250 MB峰值VRAM)运行,相比EDAC提供约10倍的减少。我们不追求整体最先进性能,而是将UNIQ定位为一种实用机制贡献,改进了离线强化学习中的性能-效率权衡。

英文摘要

Offline reinforcement learning requires careful conservatism to mitigate distribution shift, yet most existing methods apply a fixed penalty uniformly across all states regardless of local data coverage. We present UNIQ (Uncertainty-Informed Quantile), an offline RL method that introduces state-adaptive conservatism through conformally calibrated uncertainty estimation. Built on the Implicit Q-Learning (IQL) backbone, UNIQ trains a multi-expectile value ensemble, computes distribution-free uncertainty estimates using split conformal prediction, and maps the resulting signal to a state-dependent expectile that relaxes conservatism in well-covered regions while strengthening it in uncertain regions near the data frontier. On D4RL MuJoCo benchmarks, UNIQ consistently improves over IQL, with the largest gains observed on Walker2d and replay-heavy tasks. At the same time, UNIQ operates at near-IQL memory cost (approximately 250 MB peak VRAM), providing roughly a 10x reduction compared to EDAC. Rather than pursuing overall state-of-the-art performance, we position UNIQ as a practical mechanism contribution that improves the performance-efficiency trade-off in offline reinforcement learning.