arXivDaily arXiv每日学术速递 周一至周五更新
重置
CS计算机1161
2606.11166 2026-06-10 stat.OT cs.AI 新提交

Flaws in the LLM Automation Narrative

LLM自动化叙事中的缺陷

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

AI总结 通过编写代码完成数据分析任务的新基准测试,发现前沿LLM在平均性能、方差和错误幅度上均不如人类专家,挑战了LLM达到人类专家水平的说法。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被描述为在知识经济任务上达到人类专家水平。这些说法主要基于LLM在标准化数据集上衡量平均性能的基准测试任务中的表现。许多基准测试任务的主要局限性在于,它们通常基于直接包含在LLM训练数据中的内容来衡量性能,并且经常不评估LLM性能的可靠性或LLM错误的幅度。然而,在高风险情境中,这些品质至关重要。通过一项需要编写计算机代码完成数据分析任务的新型LLM基准测试,我们将前沿LLM的性能与人类专家的提交进行了比较,并明确测量了响应的方差和错误的幅度。我们的研究表明,人类专家在一系列指标上平均表现更好,并且表现出更小的性能变异性。我们的结果提供了证据,表明LLM并非始终如一地达到人类专家的水平,并证明了在LLM基准评估中测量方差和评估错误幅度的重要性。

英文摘要

Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance across standardized datasets. Primary limitations of many benchmarking tasks are that they often measure performance based on content directly included in LLM training data, and they frequently do not assess the reliability of LLM performance or the magnitude of LLM errors. However, in high stakes contexts, these qualities are critically important. Through a novel LLM benchmarking task that requires writing computer code to complete a data analysis task, we compare the performance of a frontier LLM against submissions from human experts and explicitly measure the variance of responses and the magnitude of errors. Our study reveals that the human experts perform better on average on a range of metrics and demonstrate less variability in performance. Our results provide evidence that LLMs do not consistently perform at the level of human experts and demonstrate the importance of measuring variance and assessing error magnitude in LLM benchmark evaluations.

2606.11156 2026-06-10 stat.ML cs.LG 新提交

Itô maps for any-step SDEs

任意步SDE的Itô映射

Zhengkai Pan, Peter Potaptchik, Wenxi Yao, Michael S. Albergo, Jakiw Pidstrigach

AI总结 提出Itô映射,一种任意步随机流映射,通过单次前向传播预测未来状态,实现随机动力学的精确蒸馏,并支持推理时控制和后验采样。

详情
AI中文摘要

最近的单步生成模型通过学习底层动力学的确定性流映射来加速采样。这些方法依赖于从常微分方程学习,但如何为随机动力学定义精确的蒸馏过程仍是开放问题。我们引入Itô映射,一种任意步随机流映射,它接收中间状态和布朗路径,并在单次前向传播中预测未来状态。Itô映射公式通过提供廉价、可微的后验样本访问,为推理时控制提供了新的估计器。实验上,Itô映射从固定的中间状态生成多样、条件有效的端点样本,并在合成和图像生成基准上支持强引导性能。这些结果确立了任意步SDE积分作为后验采样和随机控制的有用原语。

英文摘要

Recent one-step generative models accelerate sampling by learning deterministic flow maps of the underlying dynamics. These methods rely on learning from ordinary differential equations, leaving open how to define an exact distillation procedure for stochastic dynamics. We introduce the Itô map, an any-step stochastic flow map that takes an intermediate state and Brownian path and predicts future states in a single pass. The Itô map formulation yields novel estimators for inference-time control by providing cheap, differentiable access to posterior samples. Empirically, Itô maps produce diverse, conditionally valid endpoint samples from fixed intermediate states and support strong steering performance on synthetic and image-generation benchmarks. These results establish any-step SDE integration as a useful primitive for posterior sampling and stochastic control.

2606.11125 2026-06-10 eess.SP cs.LG 新提交

DMT: Demographic Conditioning, Morphology-Enhanced Transformer for Cuffless Blood Pressure Estimation from PPG Signals

DMT: 基于人口统计条件与形态增强Transformer的无袖带血压估计方法

Yidan Shen, Neville Mathew, Maham Rahimi, Deependra Dhakal, George Zouridakis, Xin Fu, Renjie Hu

AI总结 提出一种基于Transformer的PPG信号无袖带血压估计网络,通过FiLM风格特征调制融入人口统计信息,并添加辅助形态头引导模型关注与动脉僵硬度相关的波形形态,在PulseDB数据集上实现收缩压MAE 4.56 mmHg、舒张压MAE 2.62 mmHg。

详情
AI中文摘要

血压(BP)是心血管风险评估和治疗决策的关键指标,而光电容积描记术(PPG)能够实现低成本、可穿戴友好的无袖带血压估计。然而,即使近期取得了进展,许多基于PPG的模型仅通过血压回归进行训练,可能依赖于以振幅为主的捷径。此外,系统性调节血管顺应性的人口统计协变量通常仅通过后期融合纳入,限制了特定于主体的表示学习。我们提出了一种基于Transformer的网络,用于从PPG信号进行无袖带血压估计,利用自注意力机制捕获多个心动周期之间的长程依赖关系。为了考虑特定主体的血管差异,模型通过Transformer块的注意力和前馈子层中应用的FiLM风格特征调制,以人口统计信息为条件。此外,我们添加了一个辅助形态头,引导模型关注与动脉硬度和波反射相关的血压相关波形形态。在大型PulseDB数据集上基于校准的评估协议下,所提方法在收缩压上实现了4.56 mmHg的平均绝对误差(MAE),在舒张压上实现了2.62 mmHg,与先前的人口统计增强PPG基线相比,误差分别减少了47%和50。由此产生的轻量级单传感器模型支持在启用校准的部署场景中进行可扩展且临床可靠的无袖带血压估计。

英文摘要

Blood pressure (BP) is a key marker for cardiovascular risk assessment and therapeutic decision-making, and Photoplethysmography (PPG) enables low-cost, wearable-friendly cuffless BP estimation. However, even with recent progress, many PPG-based models are trained with BP regression alone and may rely on amplitude-dominated shortcuts. In addition, demographic covariates that systematically modulate vascular compliance are often incorporated only via late fusion, limiting subject-specific representation learning. We propose a Transformer-based network for cuffless BP estimation from PPG signal, leveraging self-attention to capture long-range dependencies across multiple cardiac cycles. To account for subject-specific vascular differences, the model is conditioned on demographics via FiLM-style feature modulation applied through the attention and feed-forward sublayers of Transformer blocks. In addition, we add an auxiliary morphology head to guide the model to attend to BP-relevant waveform morphology associated with arterial stiffness and wave reflection. Under calibration-based evaluation protocols on the large-scale PulseDB dataset, the proposed method achieves MAE of 4.56 mmHg for systolic BP and 2.62 mmHg for diastolic BP, reducing errors by 47% and 50% compared with prior demographic-enhanced PPG baselines. The resulting lightweight, single-sensor model supports scalable and clinically grounded cuffless BP estimation in calibration-enabled deployment settings.

2606.11044 2026-06-10 stat.ML cs.LG 新提交

Generalized Conformal Predictive Systems Under Distributional Shifts

广义共形预测系统在分布偏移下的应用

Jef Jonkers, Johanna Ziegel

AI总结 针对分布偏移,通过观测特定置换权重编码偏移,扩展广义共形预测系统,提出偏移感知预测系统,并引入权重不确定性框构建鲁棒共形预测系统包络,提供有限样本或渐近置信保证。

详情
Comments
27 pages, 10 figures
AI中文摘要

共形预测系统(CPS)在可交换性假设下输出校准的CDF带。我们通过观测特定的置换权重编码分布偏移,将广义CPS扩展到非可交换设置。这产生了偏移感知预测系统,当测试点(条件于无序样本)是从观测原子中加权抽取时,该系统保持有效。由于此类权重通常需要估计,我们引入了权重不确定性框,并构建了具有有限样本或渐近置信保证的鲁棒CPS包络。我们推导了符合性度量CPS、共形分箱和共形等渗分布回归的高效计算方法。在协变量偏移和反馈驱动的生物分子设计实验下,校准的预测带在更强偏移下变宽,随样本量增加而收紧。

英文摘要

Conformal predictive systems (CPS) output calibrated bands of CDFs under exchangeability. We extend generalized CPS to non-exchangeable settings by encoding distributional shifts through observation-specific permutation weights. This yields shift-aware predictive systems that remain valid whenever the test point is, conditionally on the unordered sample, a weighted draw from the observed atoms. Since such weights are typically estimated, we introduce weight-uncertainty boxes and construct robust CPS envelopes with finite-sample or asymptotic confidence guarantees. We derive efficient computation for conformity-measure CPS, conformal binning, and conformal isotonic distributional regression. Experiments under covariate shift and feedback-driven biomolecular design show calibrated predictive bands that widen under stronger shifts and tighten as sample size increases.

2606.10972 2026-06-10 eess.AS cs.AI 新提交

Optimizing 2D Input Representations and Sub-phase Fusion Strategies for Differential Diagnosis of Asthma and COPD Using CNN- and GRU-Based Networks

基于CNN和GRU网络的哮喘与COPD鉴别诊断中二维输入表示和子阶段融合策略的优化

Ipek Sen, Ozgur Ozdemir, Elena Battini Sonmez

AI总结 本研究优化了二维输入表示(MFCC、对数梅尔谱图、VAR模型)和子阶段特征融合策略(直接拼接、GRU、GRU+注意力),使用CNN和GRU网络鉴别哮喘与COPD,最佳F1分数达0.877。

详情
AI中文摘要

本研究旨在探索VAR模型与梅尔频率倒谱系数(MFCC)矩阵和对数梅尔谱图在深度学习中的性能比较。在肺音分类中,基于谱图的表示因呼吸周期时长不同而存在时间维度不一致的问题。除了传统的裁剪/零填充,还提出了自适应长度窗口来固定时间维度。通过测试一系列参数优化其频谱和时间维度。采用不同的卷积神经网络(CNN)架构从子阶段获得的二维表示中提取特征。然后使用各种策略融合提取的子阶段特征,包括直接拼接、门控循环单元(GRU)网络和带注意力的GRU。通过基于呼吸周期的评估和基于受试者的评估(包含多个呼吸周期)来评估模型性能。还研究了多种数据增强技术以应对数据规模限制。最佳基于周期的F1分数(0.877)通过使用13个系数和每子阶段表示64点时间分辨率的MFCC矩阵,随后进行直接特征拼接获得;最佳基于受试者的F1分数(0.855)通过使用13个系数和每完整周期表示256点时间分辨率的MFCC矩阵获得,两者均采用自适应长度窗口。增强总体上降低了模型性能,但mixup增强是测试方法中最好的。MFCC在区分哮喘和COPD方面优于对数梅尔谱图和VAR模型。复杂的融合策略并未改善诊断。增强没有贡献,表明真实数据在肺音研究中的重要性。

英文摘要

This study aims to explore the performance of the VAR model in comparison with mel-frequency cepstral coefficient (MFCC) matrices and log-mel spectrograms using deep learning. In pulmonary sound classification, spectrogram-based representations suffer from inconsistent temporal dimensions due to varying respiratory cycle durations. Along with traditional trimming/zero-padding, adaptive-length windowing was presented to fix their temporal dimensions. Their spectral and temporal dimensions were optimized by testing a range of parameters. Different convolutional neural network (CNN) architectures were employed to extract features from the two-dimensional representations obtained over the sub-phases. The extracted sub-phase features were then fused using various strategies including direct concatenation, gated recurrent unit (GRU) network and GRU with attention mechanism. Model performances were assessed through respiratory cycle-based evaluation and subject-based evaluation comprising multiple respiratory cycles. Several data augmentation techniques were also studied to cope with limitations in data size. The best cycle-based F1-score (0.877) was obtained using the MFCC matrices with thirteen coefficients and 64-point time resolution per sub-phase representation followed by direct feature concatenation, and the best subject-based F1-score (0.855) was obtained using the MFCC matrices with thirteen coefficients and 256-point time resolution per full-cycle representation, both obtained by adaptive-length windowing. Augmentation degraded the performance of models overall, yet mixup augmentation was the best among the methods tested. MFCC outperformed log-mel spectrogram and VAR model in differentiation of asthma and COPD. Sophisticated fusion strategies did not improve the diagnosis. Augmentation did not contribute, demonstrating the significance of authentic data in pulmonary sound studies.

2606.10906 2026-06-10 stat.ML cs.AI cs.LG 新提交

Human-AI Teaming Through the Lens of Calibration

通过校准视角看人机协作

Eric Nalisnick, Chi Zhang, Sophia Qian, Yixin Wang

AI总结 研究通过统计校准视角分析人机协作模型,发现组合方法不保留人类校准度,而委托方法将校准负担转移给拒绝器元模型,且当人类依赖系统不可观测信息时无法实现。

详情
Comments
19 pages, 5 figures (including appendix)
AI中文摘要

我们通过统计校准的视角研究人机协作模型。假设团队由AI模型和人类组成——两者相对于特征空间的某种划分都是校准的——并揭示校准假设如何传播到协作框架中。特别地,我们考虑两种框架:(i) 结合人类和模型预测,或 (ii) 将预测责任委托给人类或模型。通过理论和实证结果,我们表明现有的组合方法不保留人类的校准程度。委托方法(通过委托行为本身)保留了后续预测器的校准,但将负担转移到了决定谁进行预测的拒绝器元模型上。拒绝器必须足够精细地校准,以定位每个成员的优势所在,这一需求随着人类专业知识的增长而增加,并且当人类依赖系统无法观测的信息时变得无法实现。

英文摘要

We study models for human-AI teaming through the lens of statistical calibration. We assume the team consists of an AI model and human -- both of which are calibrated with respect to some partitioning of the feature space -- and expose how the calibration assumptions propagate into the teaming framework. In particular, we consider frameworks that either (i) combine human and model predictions or (ii) delegate prediction responsibility to either a human or model. We show via theoretical and empirical results that existing methods for combination do not preserve the human's degree of calibration. Methods for delegation (by the very act of delegation) preserve calibration of the downstream predictors but shift the burden onto the rejector meta-model that decides who predicts. The rejector must be calibrated finely enough to locate where each member is superior, a demand that grows with the human's expertise and becomes unattainable when the human relies on information the system cannot observe.

2606.10889 2026-06-10 q-bio.NC cs.LG 新提交

Sleep EEG Signal Criticality as a Non-Invasive Predictor of Cognitive Decline in Dementia

睡眠脑电信号临界性作为痴呆认知衰退的非侵入性预测指标

Stanisław Narębski, Tomasz Komendziński, Tomasz M. Rutkowski

AI总结 研究通过多重分形去趋势波动分析量化睡眠脑电信号临界性,发现认知健康者更接近最优临界状态,痴呆组DFA指数向1.0偏移,表明睡眠中无标度神经动力学重组先于临床症状,可作为早期筛查工具。

详情
Comments
4 pages, 2 figures, accepted for publication in the Proc. 48th Annu. Int. Conf. IEEE EMBS (EMBC 2026), Toronto, Canada, July 20-24, 2026
AI中文摘要

神经退行性疾病的早期检测仍然是一个关键的临床挑战。本研究探讨了通过多重分形去趋势波动分析(MFDFA)量化的睡眠脑电信号临界性是否可作为未来认知衰退的非侵入性生物标志物。我们分析了国家睡眠研究资源(NSRR)骨质疏松性骨折研究(SOF)队列的纵向数据,比较了保持认知正常与后来进展为痴呆相关损伤(3MS < 78)的女性之间的基线睡眠脑电动力学。我们的结果揭示了Hurst指数$H(q)$分布在组间的显著差异,特别是在非快速眼动阶段N2和N3期间。认知健康的个体在所有电极位置上表现出显著更接近最优临界状态的信号动力学($p \leqslant 0.001$),支持了大脑临界性假说。监督UMAP投影证实了整夜睡眠期间组间的清晰空间分离。痴呆组表现出DFA指数向$1.0$的偏移,表明睡眠中无标度神经动力学的重组先于临床症状。这些发现强调了将MFDFA衍生测量整合到自动化、基于睡眠的筛查工具中的潜力,从而能够在痴呆的前驱窗口期进行更早的预防性干预。

英文摘要

Early detection of neurodegeneration remains a critical clinical challenge. This study investigates whether sleep EEG signal criticality, quantified via Multifractal Detrended Fluctuation Analysis (MFDFA), serves as a non-invasive biomarker for future cognitive decline. We analyzed longitudinal data from the National Sleep Research Resource (NSRR) Study of Osteoporotic Fractures (SOF) cohort, comparing baseline sleep EEG dynamics between women who remained cognitively normal and those who later progressed to dementia-related impairment ($3MS < 78$).Our results reveal significant group-level differences in Hurst exponent $H(q)$ distributions, particularly during non-REM stages N2 and N3. Cognitively healthy individuals exhibited signal dynamics significantly closer to an optimally critical state across all electrode locations ($p \leqslant 0.001$), supporting the Brain Criticality Hypothesis. Supervised UMAP projections confirmed clear spatial separation between groups throughout the overnight sleep architecture.The dementia group demonstrated a shift in DFA exponents toward $1.0$, suggesting that a reconfiguration of scale-free neural dynamics during sleep precedes clinical symptoms. These findings highlight the potential for MFDFA-derived measures to be integrated into automated, sleep-based screening tools, enabling earlier preventative interventions during the prodromal window of dementia.

2606.10781 2026-06-10 eess.AS cs.CL 新提交

Recovering the Zipfian Distribution in Unsupervised Term Discovery

在无监督术语发现中恢复齐夫分布

Danel Slabbert, Simon Malan, Herman Kamper

AI总结 针对无监督术语发现中中心聚类导致分布不均匀的问题,提出图聚类方法,在三种语言上显著优于K-means等,恢复更接近齐夫分布的词汇分布。

详情
AI中文摘要

无监督术语发现涉及将未标记语音分割成词或音节单元,并将这些单元聚类成候选类型的词典。真实词典遵循齐夫分布,然而主流的基于中心的聚类方法——K-means——由于对球形聚类的归纳偏差,产生更均匀的分布。在本文中,我们重新审视基于图的聚类作为一种自下而上的替代方案,其中片段嵌入通过成对相似性连接,并使用Leiden算法进行划分。我们表明,在三种语言的词级和音节级词典发现中,图聚类在性能上显著优于基于中心的方法(K-means、GMM、BIRCH),产生更接近齐夫分布的分布。另一种自下而上的方法,即使用平均链接的凝聚聚类,也表现良好,尽管其计算效率较低,且对结果分布的控制能力较弱。我们的工作质疑了基于中心的聚类在术语发现中的主导地位,并推广图聚类作为一种有吸引力的替代方案。

英文摘要

Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.

2606.10770 2026-06-10 stat.ME cs.LG 新提交

Correcting Variable Importance Scored by Random Forests

修正随机森林产生的变量重要性评分

Guancheng Zhou, Haiping Xu, Jason Liu, Donghui Yan

AI总结 针对随机森林变量重要性受变量间相关性影响的问题,提出基于条件相关性的分组方法进行修正,实验证明两种计算高效方案均能有效校正变量重要性。

详情
Comments
22 pages, 10 figures
AI中文摘要

随机森林产生的变量重要性在统计分析中广泛应用,在辅助模型解释、模型选择和诊断、成本受限学习等任务中发挥重要作用。然而,RF中变量重要性的计算未考虑变量间的相关性,与许多其他变量相关的变量往往会获得较低的重要性指数,或被其他强相关变量完全掩盖(即重要性指数接近零)。为了在计算变量重要性时避免不相关变量的影响,我们提出根据变量的条件相关性(以响应变量为条件)对变量进行分组。我们探索了两种计算高效的方案:一种将变量单独分组,然后将感兴趣的变量与所有相关变量分离;另一种使用聚类根据变量间的成对条件相关性进行分组。实验表明,两种方法都能对变量重要性进行合理的修正。

英文摘要

Variable importance produced by Random Forests (RF) is used widely in statistical data analysis, and has played an important role in a variety of tasks such as assisting model interpretation, model selection and diagnosis, and cost-bounded learning etc. However, the calculation of variable importance in RF does not take into account of the correlations among variables, and variables that are correlated to many other variables tend to receive a lower importance index or being completely masked (i.e., with an importance index near zero) by other strongly correlated variables. To prevent influence from unwanted correlated variables in calculating variable importance, we propose to group variables by their conditional correlations (conditional on the response variable). We explore two computationally efficient options, with one grouping variables individually, and then separates the variable of interest from all correlated variables, while the other uses clustering to group variables according to their pair-wise conditional correlations. Our experiments show that both lead to sensible corrections to the importance of variables.

2606.10738 2026-06-10 eess.AS cs.AI 新提交

Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding

Spatial-Omni:通过FOA编码在多模态大语言模型中实现空间音频理解

Zhiyuan Zhu, Yixuan Chen, Yiwen Shao, Wenxiang Guo, Changhao Pan, Yu Zhang, Yuxiang Wang, Wei Liu, Houhua Zhang, Chengkuan Zeng, Wenbo Cheng, Yunxi Liu, Rui Yang, Steve Yves, Liefeng Bo, Zhou Zhao

AI总结 提出Spatial-Omni,通过SO-Encoder将一阶Ambisonics空间音频注入现有全模态大语言模型,以轻量方式实现空间音频理解,并在构建的SO-Bench基准上超越现有模型。

详情
AI中文摘要

最近的多模态大语言模型主要将音频处理为单声道信号,从而丢弃了空间音频中包含的空间线索,这些线索用于声音定位、空间关系推理和空间场景理解。我们提出Spatial-Omni,一种轻量级方法,通过实现SO-Encoder将一阶Ambisonics(FOA)空间音频作为独立模态注入现有的全模态大语言模型,而无需修改其原始音频编码器。SO-Encoder以有限的额外上下文成本提供空间标记,并通过高效的分阶段训练提升空间音频理解。为支持训练和评估,我们从开源数据、真实录音和仿真中构建了SO-Dataset、SO-QA和SO-Bench,包含40万条FOA空间音频片段和210万个空间问答对。SO-Bench涵盖16个空间音频理解子任务,包括基本检测和位置估计、空间关系理解以及复杂空间推理。实验表明,Spatial-Omni在空间音频理解任务上优于现有的开源大型音频语言模型(LALM)和全模态大语言模型,同时保持合理的通用音频理解水平。代码和数据见:https://this https URL。

英文摘要

Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at https://github.com/dieKarotte/Spatial-Omni.

2606.10713 2026-06-10 eess.IV cs.AI cs.CV cs.LG 新提交

++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

++nnU-Net: 基于前缀数据增强的nnU-Net扩展

Ana Sofia Santos, André Ferreira, Gijs Luijten, Naida Solak, Lisle Faray de Paiva, Behrus Hinrichs-Puladi, Jens Kleesiek, Jan Egger, Victor Alves

AI总结 提出++nnU-Net,通过图像配准进行数据增强,在预处理和训练前生成变形图像,在5个2D数据集上提升Dice系数最高约22%。

详情
Comments
7 pages, 1 figure, 2 tables
AI中文摘要

nnU-Net在医学分割任务中持续展现出成功,这严重依赖于标注生物医学数据的可用性和多样性。然而,由于隐私法规和标注成本等因素,收集医学影像队列仍然具有挑战性。因此,数据增强在增加数据可用性的同时保持解剖学可行性方面起着关键作用。为此,我们提出了++nnU-Net,一种基于图像配准的新型数据增强模块,在预处理和训练之前运行。我们的框架在五个不同的2D数据集上进行了评估。在该工作流中,图像数据经过两阶段配准过程,生成新的变形图像。然后将变换应用于相应的分割。此外,该管道计算可用磁盘空间,生成补充的二进制合成掩码并生成检查点。我们证明++nnU-Net优于nnU-Net基线,在Dice相似系数得分上有所提升。在最显著的情况下,我们观察到性能提升约22%。这些发现强调了基于配准的数据增强的有效性,特别是对于2D医学影像数据集,并表明++nnU-Net为在数据有限的情况下提高分割性能提供了一种实用且可扩展的方法。++nnU-Net的源代码可在以下网址获取:this https URL

英文摘要

The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git

2606.10673 2026-06-10 stat.OT cs.LG 新提交

ClusBench: The Clustering Benchmark Data Resource You've All Been Waiting For (?)

ClusBench:你一直期待的聚类基准测试数据资源(?)

David P. Hofmeyr

AI总结 本文通过拟合灵活的非参数分布,从200多个公开数据集生成近3000个合成数据集,用于大规模聚类方法评估,保留真实数据细微差别。

详情
AI中文摘要

尽管存在一些非常常见的测试平台用于评估聚类方法的性能,但大规模基准测试通常局限于相对简单的模拟设置。在这里,我们描述了近3000个合成数据集的生成和整理,这些数据集源自200多个公开可用的数据集;其中大多数来自实际应用。通过为每个基础数据集拟合灵活的非参数分布,我们能够保留真实数据中许多难以在标准模拟中重现的细微差别,同时生成的数据集的大小有时远大于它们所源自的数据集。合成数据集以及附带的R包可从该https URL下载。

英文摘要

Although some very common test beds exist for assessing the performance of clustering methods, large scale benchmarking is typically limited to relatively simplistic simulation set-ups. Here we describe the production and curation of close to 3000 synthetic data sets, derived from more than 200 publicly available data sets; the majority of which arose from real-world applications. By fitting a flexible non-parametric distribution to each base data set we are able to retain much of the nuance in real-world data which is difficult to reproduce in standard simulations, while also producing data sets whose sizes are sometimes substantially greater than the data sets from which they are derived. The synthetic data sets, plus an accompanying R package, are available for download from https://github.com/DavidHofmeyr/ClusBench.

2606.10631 2026-06-10 econ.GN cs.CR q-fin.EC 新提交

From Transactions to Records: Reconceptualizing Blockchain Systems through a Lifecycle Lens

从交易到记录:通过生命周期视角重新概念化区块链系统

Tom Barbereau, Ruggero Montalto, Christian Beyer

AI总结 本文引入ISO 15489-1:2016记录管理原则,提出区块链数据的七阶段生命周期模型,应用于比特币、同质化代币和非同质化代币,论证区块链系统不仅是交易基础设施,更是具有独特特征的记录管理系统。

详情
AI中文摘要

当前的区块链研究和分析倾向于优先考虑可观察的链上交易,掩盖了加密货币创建、公开、保留和处置的过程。为此,本文从ISO 15489-1:2016的记录管理原则出发,考虑分布式账本技术。首先指定相似之处——即交易作为“记录”,加密资产单元作为“信息资产”,区块链作为“聚合”——我们引入了区块链数据的七阶段生命周期。我们将该框架应用于比特币、同质化代币和非同质化代币。在此基础上,我们认为区块链系统不仅仅是交易基础设施,而是具有独特特征的记录管理系统。我们讨论了链上/链下边界和隐私增强技术如何使生命周期可见性复杂化,这对加密犯罪研究和调查尤为重要。作为一个元级框架,生命周期视角能够定位现有研究,按阶段分解法律、监管、技术和运营挑战,并为区块链治理、分析和监管提供生命周期感知的方法。

英文摘要

Current blockchain research and analytics tend to prioritize observable on-chain transactions, obscuring the processes through which cryptocurrencies are created, publicised, retained, and disposed of. In response, this paper considers distributed ledger technologies from records management principles in ISO 15489-1:2016. Setting off by specifying the parallels -- that is transactions as "records", crypto-asset units as "information assets", and blockchains as "aggregations" -- we introduce a seven-stage lifecycle for blockchain data. We apply the framework to Bitcoin, a fungible token, and a non-fungible token. On this basis, we argue that blockchain systems are not merely transactional infrastructures but record management systems with distinctive characteristics. We discuss how the on-chain/off-chain boundary and privacy-enhancing technologies can complicate lifecycle visibility, with particular relevance for crypto-crime research and investigation. As a meta-level framework, the lifecycle perspective enables positioning existing research, decomposing legal, regulatory, technological, and operational challenges by stage, and informing lifecycle-aware approaches to blockchain governance, analytics, and regulation.

2606.10454 2026-06-10 eess.AS cs.SD 新提交

Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM Framework: A Case Study of Multi-Domain Child-Adult ASR

熵感知域路由混合专家语音-大语言模型框架:多领域儿童-成人ASR案例研究

Mohan Shi, Kaiyuan Zhang, Zilai Wang, Natarajan Balaji Shankar, Eray Eren, Abeer Alwan

AI总结 提出一种混合专家语音-大语言模型,通过分类器域路由、混合投影器和混合LoRA模块以及熵感知路由机制,实现跨不同环境和年龄组的统一儿童-成人ASR,在公共儿童语料库上取得一致改进。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

虽然语音大语言模型在成人自动语音识别上取得了强劲性能,但其对儿童语音的有效性仍未被充分探索,且单一模型往往难以同时处理多样化的成人和儿童年龄组。本文提出一种混合专家语音-大语言模型,用于跨不同环境和年龄组的统一成人及儿童语音ASR。该框架采用基于分类器的域路由,结合粗到细策略,并集成混合投影器和混合LoRA模块以建模域特定变化。为解决域边界附近的路由不确定性,引入熵感知路由机制以动态整合共享专家。在公共儿童语料库上的实验表明,该方法在保持成人ASR性能的同时,相比基线取得了一致改进。据我们所知,这是首个利用语音-大语言模型实现涵盖儿童和成人的统一多领域ASR的工作。

英文摘要

While Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously. This paper proposes a Mixture-of-Experts (MoE) Speech-LLM for unified ASR across adult and child speech spanning diverse environments and age groups. The framework employs a Classifier-based Domain Router (C-DR) with a coarse-to-fine strategy and integrates both a Mixture-of-Projectors (MoP) and a Mixture-of-LoRAs (MoL) to model domain-specific variations. To address routing uncertainty near domain boundaries, an Entropy-Aware Routing (EAR) mechanism is introduced to dynamically incorporate a shared expert. Experiments on public child corpora demonstrate consistent improvements over baselines while preserving adult ASR performance. To our knowledge, this is the first work leveraging Speech-LLMs for unified, multi-domain ASR encompassing both children and adults.

2606.10361 2026-06-10 stat.ML cs.LG 新提交

Near-Exponential Convergence Rates for kNN Classification based on Boltzmann Margin

基于玻尔兹曼间隔的kNN分类近指数收敛速率

Luyuan Yang, Shayan Shafaei, Chao Lan

AI总结 提出玻尔兹曼间隔条件,介于Tsybakov与Massart间隔之间,首次证明kNN分类器可实现近指数收敛速率。

详情
Comments
Conference on Uncertainty in Artificial Intelligence (UAI)
AI中文摘要

分类器的收敛速率分析通常在Tsybakov间隔或Massart间隔下进行。前者是相对较弱的条件,通常产生多项式速率,而后者更强,但能保证指数速率。本文引入一种新条件,称为玻尔兹曼间隔,它填补了这两种机制之间的空白。该条件弱于Massart间隔,通常强于Tsybakov间隔,并在适当条件下能蕴含它们的许多性质。我们将玻尔兹曼间隔应用于kNN分类器的分析,并建立了kNN分类的第一个近指数收敛速率。我们还给出了主要结果的扩展,并提供了支持主要理论结论的数值证据。

英文摘要

Convergence-rate analysis for classifiers is often conducted under either Tsybakov margin or Massart margin. The former is a relatively weak condition that typically yields polynomial rates, while the latter is substantially stronger but can guarantee exponential rates. In this paper, we introduce a new condition, called Boltzmann margin, that bridges the gap between these two regimes. It is weaker than Massart margin, generally stronger than Tsybakov margin, and can imply many of their properties under suitable conditions. We apply Boltzmann margin to the analysis of kNN classifiers and establish the first near-exponential convergence rates for kNN classification. We also present extensions of the main results and provide numerical evidence supporting the main theoretical implications.

2606.10317 2026-06-10 eess.AS cs.SD 新提交

SSL-GMMVC: Interpretable Voice Conversion via Locally Linear GMM Transforms in Self-Supervised Representation Space

SSL-GMMVC:自监督表示空间中通过局部线性GMM变换的可解释语音转换

Tomoya Tanabu, Hiroshi Nishijima, Daisuke Saito, Nobuaki Minematsu

AI总结 提出SSL-GMMVC方法,在自监督语音空间中用高斯混合模型建模源-目标特征,通过后验加权仿射变换实现可解释的语音转换,在保持可理解性和自然度的同时提升说话人相似度。

详情
Comments
Accepted to Interspeech2026
AI中文摘要

我们介绍了SSL-GMMVC,一种在自监督语音空间中可解释的语音转换方法。该方法使用高斯混合模型对配对的源-目标特征进行建模,并将转换表示为仿射变换的后验加权和。这产生了适应异质特征空间结构且保持解析可处理性的局部线性变换。通过客观和主观评估,我们表明SSL-GMMVC在保持相当可理解性和自然度的同时提高了说话人相似度,并且随着混合成分数量的增加,即使是受限协方差变体也超过了深度学习基线。进一步的分析将成分选择与语音结构联系起来,并揭示了学习变换中可解释的缩放和旋转。这些发现凸显了SSL-GMMVC作为一种有效且可分析的语音转换框架。

英文摘要

We introduce SSL-GMMVC, an interpretable voice conversion method in self-supervised speech space. The method models paired source-target features with a Gaussian mixture model and performs conversion as a posterior-weighted sum of affine transforms. This yields locally linear transformations that adapt to heterogeneous feature-space structure while remaining analytically tractable. Through objective and subjective evaluations, we show that SSL-GMMVC improves speaker similarity with comparable intelligibility and naturalness, and that even a constrained covariance variant surpasses a deep learning baseline as the number of mixture components increases. Further analyses link component selection to phonetic structure and reveal interpretable scaling and rotation in the learned transforms. These findings highlight SSL-GMMVC as an effective, analyzable framework for voice conversion.

2606.10301 2026-06-10 eess.SP cs.SY eess.SY 新提交

Fundamentals of NOMA in Low-Earth Orbit Coordinated Multi-Satellite Networks

低轨协调多卫星网络中NOMA的基础原理

Xiangyu Li, Bodong Shang, Junchao Ma, Qingqing Wu, Jie Feng, Deshuang Huang

AI总结 研究低轨协调多卫星网络结合非正交多址接入的下行性能,利用随机几何分析覆盖与频谱效率,发现增加协作卫星不一定提升性能,合理功率分配可带来显著增益。

详情
AI中文摘要

协调多卫星(CoMS)传输和非正交多址接入(NOMA)被设想为共同增强卫星网络的覆盖、容量和频谱效率。将它们整合到一个统一的CoMS-NOMA框架中,将实现更高效、可靠和节能的多用户接入。本文从系统级角度研究了CoMS-NOMA网络的下行性能,其中多颗卫星通过NOMA协作服务多个用户。利用随机几何工具,首先推导了CoMS-NOMA中的相关角度和距离作为中间结果。然后,我们获得了组合信号功率分布,并在卫星间和卫星内干扰下分析了覆盖和频谱性能,同时考虑了潜在的不完美连续干扰消除(SIC)。该分析模型在一系列系统参数下得到验证,包括卫星数量、服务区域角度、误差传播因子和功率分配系数。数值结果表明,增加协作卫星的数量并不总是提高覆盖和频谱效率。此外,虽然更高的主瓣增益改善了覆盖,但近乎完美的SIC仅比合理良好的SIC提供稍大的好处。通过适当选择的功率分配系数,与传统的正交和单卫星方案相比,CoMS-NOMA实现了高达270%的覆盖改善和56%的总频谱效率增益,表明其在绿色、节能卫星组网方面的潜力。

英文摘要

Coordinated multi-satellite (CoMS) transmission and non-orthogonal multiple access (NOMA) are envisioned to jointly enhance coverage, capacity, and spectrum efficiency for satellite networks. Their integration into a unified CoMS-NOMA framework will allow more efficient, reliable, and energy-efficient multi-user access. This paper investigates the downlink performance of CoMS-NOMA networks from a system-level perspective, in which multiple satellites cooperatively serve multiple users via NOMA. Leveraging tools from stochastic geometry, related angles and distances in CoMS-NOMA are first derived as intermediate results. Then, we obtain the combined signal power distributions and analyze coverage and spectrum performance under both inter- and intra-satellite interference, accounting for potential imperfect successive interference cancellation (SIC). The analytical model is validated across a range of system parameters, including the number of satellites, service region angle, error-propagation factor, and power allocation coefficients. Numerical results indicate that increasing the number of cooperative satellites does not always improve coverage and spectrum efficiency. Additionally, while a higher main-lobe gain improves coverage, a near-perfect SIC provides only slightly greater benefits than a reasonably good SIC. With properly selected power allocation coefficients, CoMS-NOMA achieves up to a 270% improvement in coverage and a 56% gain in sum spectral efficiency, compared with conventional orthogonal and single-satellite schemes, indicating potential for green, energy-efficient satellite networking.

2606.10280 2026-06-10 eess.IV cs.CV 新提交

Overlapped Wavelet Diffusion for Low-Light Image Enhancement

重叠小波扩散用于低光照图像增强

Fen Peng, Taizo Suzuki, Seisuke Kyochi

AI总结 提出重叠小波扩散框架OWDiff,通过重叠小波变换消除块伪影,并引入低频引导的高频增强模块恢复细节,在LOLv1和LOLv2-real数据集上优于现有方法。

详情
Journal ref
IEICE Transactions on Information and Systems, Advance online publication, 2026
Comments
Advance published in IEICE Transactions on Information and Systems. DOI: 10.1587/transinf.2026PCP0006. Code: https://github.com/FinnPeg/Overlapped-Wavelet-Diffusion
AI中文摘要

在这项研究中,我们提出了一种用于低光照图像增强(LLIE)的重叠小波扩散框架,该框架包含两个互补组件,以实现无块伪影和细节保持的增强。尽管与传统方法相比,最近基于扩散的LLIE方法表现出显著性能,但DiffLL仍然遭受由Haar小波变换(WT)引起的块伪影以及由于其高频恢复模块(HFRM)的限制导致的边缘模糊或纹理过度平滑。为了克服这些问题,我们引入了重叠小波变换(OWT),它融合了相邻区域的相关性,从而在结构上防止块伪影。此外,我们集成了一个低频引导的高频增强模块(HFEBlock)来加强细节恢复,产生更清晰的边缘和更可靠的纹理。在LOLv1和LOLv2-real数据集上的大量实验表明,我们的框架(称为OWDiff)在定性和定量上均持续优于现有的LLIE方法,在保持计算效率的同时实现了卓越的视觉质量。OWDiff有效解决了Haar WT和HFRM的结构限制,与DiffLL相比,在LOLv1和LOLv2-real数据集上平均PSNR增益为0.58 dB,SSIM相对提高1.64%,LPIPS相对降低5.9%。

英文摘要

In this study, we propose an overlapped wavelet diffusion framework for Low-Light Image Enhancement (LLIE), which incorporates two complementary components to achieve blocking artifact-free and detail-preserving enhancement. Although recent diffusion-based LLIE methods have demonstrated remarkable performance compared with traditional approaches, DiffLL still suffers from blocking artifacts caused by the Haar Wavelet Transform (WT) and blurred edges or over-smoothed textures due to the limitations of its High-Frequency Restoration Module (HFRM). To overcome these issues, we introduce an Overlapped WT (OWT) that incorporates correlations across neighboring regions, thereby structurally preventing blocking artifacts. Furthermore, we integrate a low-frequency-guided High-Frequency Enhance Block (HFEBlock) to strengthen detail recovery, yielding sharper edges and more reliable textures. Extensive experiments on the LOLv1 and LOLv2-real datasets demonstrate that our framework, termed OWDiff, consistently outperforms existing LLIE methods both qualitatively and quantitatively, achieving superior visual quality while maintaining computational efficiency. OWDiff effectively addresses the structural limitations of the Haar WT and the HFRM, achieving an average PSNR gain of 0.58 dB, along with a 1.64% relative improvement in SSIM and a 5.9% relative reduction in LPIPS, compared to DiffLL across both the LOLv1 and LOLv2-real datasets.

2606.10238 2026-06-10 q-bio.NC cs.AI 新提交

Hyperbolic Neural Population Geometry Benefits Computation

双曲神经群体几何结构有益于计算

Dennis Wu, Yi-Chun Hung, Braden Yuille, James E. Fitzgerald, Han Liu

AI总结 本文提出海马体群体活动诱导双曲几何的理论框架,证明现代Hopfield网络更新规则计算最小均方误差估计,并引入双曲空间中的新联想记忆模型,其容量显著优于现有模型。

详情
Comments
Accepted at ICML 2026, 37 pages, 5 figures
AI中文摘要

神经群体几何结构影响下游计算。最近神经生物学的实验发现表明,海马体中的群体活动具有双曲结构。本文为这一现象提供了理论框架。首先,我们提出了一种海马体调谐曲线的合理构造,该构造在统计上诱导双曲几何。接着,我们通过证明现代Hopfield网络更新规则计算最小均方误差(MMSE)估计,建立了神经解码与联想记忆之间的联系。最后,我们引入了一个在双曲空间中定义的新型联想记忆模型,其容量显著大于领先模型。我们的结果表明,动物将空间信息编码为潜在的双曲认知地图,从而提高了记忆容量和解码精度。

英文摘要

Neural population geometry shapes downstream computation. Recent empirical findings in neurobiology suggest that a hyperbolic structure underlies population activity in the hippocampus. Here we provide a theoretical framework for this phenomenon. First, we propose a plausible construction of hippocampal tuning curves that statistically induces hyperbolic geometry. Next, we establish a connection between neural decoding and associative memory by demonstrating that the Modern Hopfield Network update rule computes the minimum mean-squared-error (MMSE) estimator. Finally, we introduce a novel associative memory model defined in hyperbolic space that yields significantly larger capacity than leading models. Our results suggest that animals encode spatial information as a latent hyperbolic cognitive map, improving both memory capacity and decoding accuracy.

2606.10233 2026-06-10 eess.AS cs.LG cs.SD 新提交

ANCHOR: Autoregressive Non-intrusive Chunk-Ordered Refinement for Joint Multi-Resolution Speech Quality Modeling

ANCHOR: 自回归非侵入式分块有序细化用于联合多分辨率语音质量建模

Zhuoyan Tao, Jiatong Shi, Hye-jin Shim, Shinji Watanabe

AI总结 提出ANCHOR模型,将增量语音质量评估重构为多分辨率自回归任务,通过双分辨率令牌和分辨率感知层次实现分块到整句的粗到细细化,在部分输入下显著降低误差,并揭示感知质量的时域积累机制。

详情
Comments
Accepted at Interspeech 2026
AI中文摘要

虽然语音质量通常是在完整话语上评估的,但流式和生成系统需要从部分音频中进行增量估计。现有的预测器假设完整的上下文,在受前缀约束的输入上性能下降。扩展ARECHO,我们提出ANCHOR,将增量评估重新表述为多分辨率自回归任务。它使用双分辨率令牌和分辨率感知层次结构在单个解码器中建模分块级和话语级质量,实现从粗到细的细化。实验表明,在部分输入下具有显著的鲁棒性,包括在2秒前缀上PLCMOS误差减少48%。收敛性分析揭示了4-6秒的有效感知上下文范围。压力测试进一步隔离了局部损坏下的结构化外推偏差。结果表明,层次监督改进了增量预测,并阐明了感知质量如何随时间累积。

英文摘要

While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.

2606.10231 2026-06-10 eess.AS cs.SD 新提交

LLM can Read Spectrogram: Encoder-free Speech-Language Modeling

LLM 能读频谱图:无编码器的语音语言建模

Ruchao Fan, Yiming Wang, Yuxuan Hu, Bo Ren, Yufei Xia, Xiaofei Wang, Yao Qian, Jinyu Li

AI总结 提出 Mel-LLM,一种无需专用语音编码器、直接将梅尔频谱图补丁通过线性投影输入 LLM 的架构,在 ASR 和 TTS 任务上验证了其可行性,ASR 性能与有编码器方案相当,TTS 初步可行。

详情
AI中文摘要

最近的语音感知大语言模型(Speech-LLMs)依赖预训练的语音编码器将音频转换为 LLM 可消费的语义丰富表示。相反,在这项工作中,我们探索:LLM 能否直接学习读取梅尔频谱图,而无需专用的语音编码器?我们提出 Mel-LLM,一种无编码器的 Speech-LLM,它将经过轻量预处理的梅尔频谱图补丁通过线性投影直接输入 LLM,使 LLM 仅通过自身参数学习语音-文本对齐。我们在自动语音识别(ASR)和文本到语音(TTS)任务上进行了大量实验。对于 ASR,我们在 OpenASR 排行榜公开集和生产级扩展实验上评估,表明无编码器方案在性能上具有竞争力,与有编码器初始化的对应方案相比仅有有限退化。我们发现,当数据有限时,从多模态检查点(Phi-4-MM)初始化对于保持性能至关重要。我们还进行了消融研究,揭示了哪些 LLM 层与语音编码相关性较低。对于 TTS,我们展示了使用下一个令牌 VAE 方法的初步结果。虽然 TTS 性能尚未达到最优,但这些结果确立了用于自回归语音-文本建模的完全统一无编码器架构的可行性。

英文摘要

Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.

2606.10187 2026-06-10 stat.ML cs.LG 新提交

Decision-Calibrated Conformal Uncertainty for Pacing Decisions in Streaming Advertising

面向流式广告中节奏控制的决策校准共形不确定性

Prashant Shekhar, Caroline Howard

AI总结 提出一种决策校准共形框架,通过衡量预测误差对实际部署策略的最大影响来校准不确定性,理论证明该分数是保护所有可部署节奏控制策略的最小有效不确定性度量,并在公开数据集上显著降低不确定性半径。

详情
AI中文摘要

我们开发了一个决策校准的共形框架,用于流式广告中的节奏控制决策。节奏控制依赖于不确定的未来库存、需求压力、增量响应和会员体验负载。该框架不是校准通用的预测残差,而是通过预测误差对实际可能部署的策略的最大影响来衡量预测误差。主要定理表明,所提出的分数是统一保护所有可部署节奏控制策略的最小有效不确定性度量。几何上,它是有符号策略敏感性集的支持函数。分裂共形校准为该分数提供了有限样本覆盖。一个高维分离定理表明,传统的残差校准可能因支付干扰库存维度而任意保守,而一个鲁棒的节奏控制结果结合了库存、响应和体验不确定性。在基于Criteo Uplift和KuaiRand数据集构建的公开数据校准节奏控制回放中,传统共形节奏控制仍然未解决,在Criteo上残差半径高达7236.7,在KuaiRand上为4629.4。采用所提出的决策校准方法,不确定性半径分别降至18.4和278.6,并为价值、交付、预算和会员负载设置了单独的边际。在Criteo上,所提出的方法证明了比点预测基线更不激进的节奏控制策略,并将保留的任何违规率从16.7%降至3.3%,且预算和会员负载违规为零。在KuaiRand上,选择仍未解决。简而言之,本文确立了预测、响应估计和会员体验模型应根据它们是否缩小节奏控制决策使用的不确定性来判断,因为这会导致自信且不过度保守的决策。

英文摘要

We develop a decision-calibrated conformal framework for pacing decisions in streaming advertising. Pacing depends on uncertain future inventory, demand pressure, incremental response, and member-experience load. Instead of calibrating a generic forecast residual, the framework measures forecast error by its largest impact on the policies that could actually be deployed. The main theorem shows that the proposed score is the smallest valid uncertainty measure that uniformly protects all deployable pacing policies. Geometrically, it is the support function of the signed policy sensitivity set. Split conformal calibration gives finite-sample coverage for this score. A high-dimensional separation theorem shows that traditional residual calibration can be arbitrarily more conservative by paying for nuisance inventory dimensions, and a robust pacing result combines inventory, response, and experience uncertainty. On public-data-calibrated pacing replays built from Criteo Uplift and KuaiRand datasets, traditional conformal pacing remains unresolved with high residual radii of 7236.7 on Criteo and 4629.4 on KuaiRand. With the proposed decision calibration approach, the uncertainty radii are reduced to 18.4 and 278.6 respectively, with separate margins for value, delivery, budget, and member load. On Criteo, the proposed method certifies a less aggressive pacing policy than the point-forecast baseline, and reduces held-out any-violation rate from 16.7% to 3.3%, with zero budget and member-load violations. On KuaiRand, the choice remains unresolved. In a nutshell, the paper establishes that forecasts, response estimates, and member-experience models should be judged by whether they shrink the uncertainty that the pacing decision uses, as this leads to confident decisions that are not overly conservative.

2606.10125 2026-06-10 stat.ML cs.DB cs.LG 新提交

Robust Active Learning for Few-Shot Example Selection in Text-to-SQL

鲁棒主动学习用于文本到SQL中的少样本示例选择

Arash Pourhabib

AI总结 针对文本到SQL中少样本示例选择,提出一种鲁棒主动学习方法,通过分层贪婪算法最大化异方差互信息目标,在嵌入流形上实现常数因子近似保证,显著减少标注成本。

详情
Comments
31 pages, 4 figures, 5 tables
AI中文摘要

少样本示例检索是将大型语言模型(LLM)应用于特定领域文本到SQL系统的主要范式。然而,标注示例库的质量直接决定系统准确性,且专家标注成本高昂。我们将这些示例的主动选择形式化为一个在语义查询嵌入的内在低维流形上的约束实验设计问题。与标准主动学习框架不同,我们的设置引入了三个关键挑战:依赖于查询的可变标注可靠性(异方差性)、跨语义主题的空间多样性严格要求(划分拟阵约束),以及嵌入空间真实协方差结构未知的固有现实(模型误设)。为了解决这些问题,我们提出了一种分层贪婪算法,该算法最大化异方差互信息目标。我们证明该目标在内在流形上保持子模性和近似单调性,从而得到理论上的常数因子近似保证。我们建立了一个谱界,表明当假设的替代核与真实数据生成过程存在偏差时,该近似保证会优雅地退化,而非灾难性地崩溃。实验结果表明,所提出的策略显著减少了标注工作量,同时保持了较高的文本到SQL检索准确性。

英文摘要

Few-shot example retrieval is the dominant paradigm for grounding large language models (LLMs) in domain-specific text-to-SQL systems. However, the quality of the annotated example bank directly governs system accuracy, and expert annotation is prohibitively expensive. We formalize the active selection of these examples as a constrained experimental design problem over the intrinsic, low-dimensional manifold of semantic query embeddings. Unlike standard active learning frameworks, our setting introduces three critical challenges: varying, query-dependent annotation reliability (heteroscedasticity), strict requirements for spatial diversity across semantic topics (partition matroid constraints), and the inherent reality that the true covariance structure of the embedding space is unknown (misspecification). To address these, we propose a stratified greedy algorithm that maximizes a heteroscedastic mutual information objective. We prove that this objective remains submodular and approximately monotonic on the intrinsic manifold, yielding a theoretical constant-factor approximation guarantee. We establish a spectral bound demonstrating that this approximation guarantee degrades gracefully, rather than catastrophically, when the assumed surrogate kernel diverges from the true underlying data-generating process. Empirical results demonstrate that the proposed strategy significantly reduces labeling effort while maintaining high text-to-SQL retrieval accuracy.

2606.10010 2026-06-10 eess.AS cs.AI cs.MM cs.SD 新提交

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

DeRA-MOS:通过解耦列表排序和模态对齐优化文本到音乐评估

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

AI总结 提出DeRA-MOS解耦优化框架,通过批感知列表排序损失和分数锚定模态对齐损失,分别优化音乐印象和文本对齐的排名指标,在MusicEval上显著提升评估性能。

详情
Comments
Accepted to IEEE Signal Processing Letters (SPL)
AI中文摘要

评估文本到音乐(TTM)系统仍然昂贵,因为音乐印象(MI)和文本对齐(TA)分数依赖于人类平均意见分数(MOS)。大多数自动MOS估计器采用逐点回归或分布分类训练。这些目标不直接优化基于排名的指标,并且为跨模态一致性提供较弱的几何约束。为了解决这些问题,我们提出了DeRA-MOS,一种用于TTM评估的解耦优化框架。对于MI,我们引入了一种批感知列表排序损失,该损失对每个小批量内的相对顺序进行建模,并更好地与基于Spearman秩相关系数(SRCC)的评估对齐。对于TA,我们引入了一种分数锚定的模态对齐损失,将人类分数映射到目标音频-文本相似度,并在融合前正则化潜在空间。通过有效缓解逐点训练不匹配和模态漂移,MusicEval上的实验表明,我们的解耦框架在MI和TA排名指标上均取得了显著改进,为大规模TTM评估建立了稳健的范式。

英文摘要

Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.

2606.09953 2026-06-10 eess.IV cs.AI cs.LG 新提交

Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT

深度切片插值用于减少头部CT的穿平面各向异性和噪声

Luis Cortés Ferre, Miguel A. Gutiérrez-Naranjo, Marcin Balcerzyk

AI总结 提出一种深度学习系统,通过相邻轴向切片对合成中间CT切片,将有效穿平面间距减半,同时实现隐式降噪,在结构指标上优于经典插值和视频帧插值方法。

详情
AI中文摘要

头部计算机断层扫描(CT)通常使用亚毫米级的面内分辨率,但穿平面间距为2-5毫米,造成显著的各向异性,这会降低多平面重建、血肿体积估计等体积测量以及假设近似各向同性体素的后续算法的性能。我们提出一个深度学习系统,从相邻轴向切片对合成中间CT切片,将有效穿平面间距减半。该系统改善三维可视化,同时产生固有降噪的输出,在一次推理中实现两个互补优势。为构建可靠系统,我们系统评估像素级损失(均方误差MSE和平均绝对误差L1)、结构相似性损失(结构相似性指数SSIM及其多尺度变体MS-SSIM)以及混合组合。在保留测试集上,所有收敛模型在所有结构指标上均优于经典插值基线和预训练视频帧插值方法(RIFE、FILM),其中MS-SSIM+L1提供最强平衡性能。我们还记录了SSIM族损失中的训练不稳定性并识别部分补救措施:标准数值修复消除了主要失败模式,但在较小批量大小下留下残余发散。所有结果均报告患者级自助法置信区间和配对统计检验。作为示例,我们将系统应用于来自Virgen del Rocío大学医院的非分布头部CT序列:模型合成中间切片,并在真实切片上表现出我们理论分析预测的隐式降噪特征,支持在单个外部病例中插值质量和隐式降噪不局限于训练分布。

英文摘要

Head computed tomography (CT) typically uses sub-millimeter in-plane resolution but 2-5 mm through-plane spacing, creating substantial anisotropy that degrades multiplanar reconstructions, volumetric measurements such as hematoma volume estimation, and downstream algorithms that assume near-isotropic voxels. We present a deep learning system that synthesizes intermediate CT slices from pairs of neighboring axial slices, halving the effective through-plane spacing. The system improves three-dimensional visualization while simultaneously producing inherently denoised outputs, yielding two complementary benefits from a single inference pass. To build a reliable system, we systematically evaluate pixel-wise losses, namely mean squared error (MSE) and mean absolute error (L1); structural-similarity losses, namely the structural similarity index (SSIM) and its multi-scale variant (MS-SSIM); and hybrid combinations. On a held-out test set, all converged models outperform classical interpolation baselines and pretrained video frame interpolation methods (RIFE, FILM) on all structural measures, with MS-SSIM+L1 offering the strongest balanced profile. We also document training instability in SSIM-family losses and identify partial remedies: the standard numerical fixes eliminate the dominant failure mode but leave residual divergence at smaller batch sizes. All results are reported with patient-level bootstrap confidence intervals and paired statistical tests. As an illustration, we apply the system to an out-of-distribution head CT series from Hospital Universitario Virgen del Rocío: the model synthesizes intermediate slices and exhibits on the real slices the implicit-denoising signature predicted by our theoretical analysis, supporting in a single external case that interpolation quality and implicit denoising are not confined to the training distribution.

2606.09944 2026-06-10 econ.GN cs.AI q-fin.EC 新提交

GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-Aware Macroeconomic Welfare Monitoring

GAGI:一种用于分布感知宏观经济福利监测的基尼调整人均GDP指数

Sivasathivel Kandasamy

AI总结 提出GAGI指数,通过基尼系数和价格水平调整人均GDP,以监测福利分配效应,应用于G7国家发现福利增长与GDP增长持续偏离。

详情
AI中文摘要

人均GDP是政府机构追踪经济繁荣和经济事件后果的默认视角,但它忽视了生活繁荣的两个首要决定因素:收入/财富分配和通胀影响。不平等调整的收入衡量指标本身并不新鲜,但宏观经济监测工具包中具体缺失的不是福利概念,而是一个可操作的监测触发指标:一个足够简洁、可每年从公开数据计算、无需建模假设即可审计、且标准化以便于理解年度间和国家间变化(监管机构需要据此采取行动)的统计量。我们构建了这样一个工具,即基尼调整人均GDP指数(GAGI):一种可复现、可公开计算的公式,通过不平等调整因子(1-G)和价格水平重新调整各国人均GDP,并以2010年为基准标准化。GAGI是一个通用福利指数,并非特定于AI自动化,适用于任何需要追踪福利调整后繁荣的场景。将GAGI应用于2010-2026年的G7经济体,我们发现福利调整后的繁荣与总体GDP增长持续且日益偏离,这种偏离在2022年后急剧扩大,时间上与COVID后遗症和生成式AI部署加速相吻合,尽管仅凭此证据尚不能证明因果关系。我们认为GAGI是基于GDP监测的必要补充:任何仅追踪总产出的宏观经济监测工具都会系统性地忽略自动化可能造成的分配损害,即使报告的增长依然强劲。

英文摘要

GDP per capita is the default lens through which governibng bodies track the economic prosperity and consequences of economic events , yet it is blind to two first-order determinants of lived prosperity: income/wealth distribution and inflation impact. Inequality-adjusted income measures are themselves not new but What is missing from the macroeconomic monitoring toolkit specifically is not a welfare concept but an operational monitoring trigger: a statistic minimal enough to compute annually from public data, transparent enough to audit without modelling assumptions, and normalised so that year-on-year, cross-country change ? the quantity a regulator needs to act on? is legible. We assemble such an instrument, the Gini- Adjusted GDP per Capita Index (GAGI): a reproducible, publicly computable formulation that rescales each country's GDP per capita by its inequality-adjustment factor (1-G) and its price level, normalised to a 2010 baseline. GAGI is a general-purpose welfare index, not inherently specific to AI automation, applicable wherever welfare-adjusted prosperity needs tracking. Applying GAGI to the G7 economies over 2010-2026, we show that welfare-adjusted prosperity has diverged persistently and increasingly from headline GDP growth, that the divergence widens sharply after 2022, temporally coincident with, though not, on this evidence alone, demonstrated to be caused by the after effects of COVID and the acceleration of generative-AI deployment. We argue that GAGI is a necessary complement to GDP-based monitoring: any macroeconomic monitoring instrument that tracks only aggregate output will systematically miss the distributional harm that automation can cause even while reported growth remains strong.

2606.09941 2026-06-10 stat.AP cs.LG stat.OT 新提交

Stochastic weather generators for high-frequency wind vector time series

高频风矢量时间序列的随机天气生成器

Mingshi Cui, Kevin Eng, Justin T. Greene, Zern Ke, Abolfazl Sodagartojgi, Zhiqiu Xia, Gemma E. Moran, Michael L. Stein

AI总结 针对分钟级风矢量时间序列,开发基于时间矢量量化变分自编码器的机器学习模型,生成逼真序列,捕捉昼夜变化但极端风速分布匹配不足。

详情
AI中文摘要

地表风速在分钟尺度上变化显著,因此有必要研究其在此精细时间尺度上的变化。为最小化季节性影响,本文限定于六月,基于俄克拉荷马州拉蒙特站点超过30年的分钟级高质量测量数据,开发了一系列用于生成真实地表风矢量时间序列的机器学习模型。此类生成器可作为多种学科模型的输入,特别是风能领域,同时也适用于野火蔓延和航空等。数据显示风速和风向均存在复杂的昼夜结构,标准时间序列模型难以捕捉,因此我们考虑多种机器学习方法,基于时间矢量量化变分自编码器构建随机风生成器。我们考虑一次生成一天的数据,以及基于前一天风况生成一天的风矢量。我们还研究了在生成器中纳入离散天气状态变量的方法。我们使用多种正式和非正式方法评估生成器。其中最佳生成器能够捕捉观测数据中的许多(但非全部)复杂特征。特别地,我们的最佳方法准确模拟了风波动性的昼夜变化,但在匹配观测到的极端风速分布方面存在困难。

英文摘要

Surface winds can vary substantially from one minute to the next, so there is scope for studying its variation on this fine time scale. Restricting to the month of June to minimize seasonality, this work develops a range of machine learning models for generating realistic time series of surface wind vectors at a site in Lamont, Oklahoma based on more than 30 years of high quality measurements at the minute time scale. Such a generator could be used as an input into models from a range of disciplines, notably for wind energy, but also wildfire spread and aviation, among others. The data show complex diurnal structures in both wind speed and direction that would be challenging to capture with standard time series models, so we consider a number of machine learning approaches to producing a stochastic wind generator based on time vector-quantized variational autoencoders. We consider generating a day's worth of data at a time and generating a day of wind vectors conditional on the previous day's winds. We also study methods for incorporating a discrete weather state variable in the generator. We evaluate the generators using a wide range of formal and informal methods. The best of these generators can capture many but not all of the complex features present in the observational data. In particular, the best of our approaches accurately mimic diurnal changes in wind volatility but struggle to match the observed distribution of extreme wind speeds.

2606.09893 2026-06-10 eess.IV cs.AI cs.LG 新提交

Tractogram foundation model

TractFM:纤维束图基础模型

Guikun Chen, Yuqian Chen, Yijie Li, Yogesh Rathi, Nikos Makris, Fan Zhang, Wenguan Wang, Lauren J. O'Donnell

AI总结 提出TractFM基础模型,直接从全脑纤维束集学习可复用表示,结合局部纤维编码器和置换等变纤维束编码器,通过密集解剖束分割预训练,实现纤维束级和受试者级任务的迁移。

详情
AI中文摘要

扩散MRI(dMRI)纤维束成像是在活体人脑中绘制白质通路的唯一非侵入性方法。它将每个大脑表示为一个纤维束图:一个大型、无序的三维流线集合,包含局部流线几何和全脑解剖组织的信息。这种结构使纤维束图成为表示学习的自然但具有挑战性的目标。现有方法将流线分类和受试者级预测视为独立问题:流线分类器关注几何模式,而受试者级预测通常依赖于手工特征。因此,当前方法无法学习连接流线解剖与全脑受试者间变异的可复用表示。本文介绍TractFM,一个纤维束图基础模型,直接从全脑纤维束集学习可复用表示。TractFM结合了局部流线编码器和置换等变纤维束编码器,使得一个受试者的所有流线能够在单次前向传递中共同上下文化。在密集解剖束分割(即给单个流线分配解剖标签)上的预训练产生了两种互补表示:用于束分割的上下文化流线级嵌入和用于下游受试者表型预测的紧凑受试者级描述符。在三种纤维束成像算法和五个dMRI数据集上,TractFM迁移到流线级和受试者级任务。其冻结表示实现了准确的束分割,并在独立数据集上预测年龄和性别。这些结果表明,全脑几何上下文(一次性学习)可以泛化到纤维束成像流程、数据集和预测任务中。

英文摘要

Diffusion MRI (dMRI) tractography is the only noninvasive approach for mapping white-matter pathways in the living human brain. It represents each brain as a tractogram: a large, unordered set of three-dimensional streamlines that includes information about both local streamline geometry and whole-brain anatomical organization. This structure makes tractograms a natural but challenging target for representation learning. Existing methods treat streamline classification and subject-level prediction as separate problems: streamline classifiers focus on geometric patterns, whereas subject-level prediction often depends on hand-crafted features. As a result, current methods do not learn reusable representations that connect streamline anatomy with whole-brain inter-subject variation. Here we introduce TractFM, a tractogram foundation model that learns reusable representations directly from whole-brain streamline sets. TractFM combines a local streamline encoder with a permutation-equivariant tractogram encoder, allowing all streamlines from a subject to be contextualized jointly in a single forward pass. Pretraining on dense anatomical tract parcellation, i.e., assigning anatomical labels to individual streamlines, yields two complementary representations: contextualized streamline-level embeddings for tract parcellation and compact subject-level descriptors for downstream prediction of subject phenotypes. Across three tractography algorithms and five dMRI datasets, TractFM transfers to both streamline-level and subject-level tasks. Its frozen representations achieve accurate tract parcellation and predict age and sex across independent datasets. These results show that whole-brain geometric context, learned once, can generalize across tractography pipelines, datasets, and prediction tasks.

2606.11190 2026-06-10 cs.LG 新提交

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

何时对齐,何时预测:多模态学习的相图

Ilay Kamai, Hugues Van Assel, Aviv Regev, Hagai B. Perets, Randall Balestriero

发表机构 * Technion(以色列理工学院) Genentech(基因泰克公司) Brown University(布朗大学) Meta AI, FAIR

AI总结 提出统一线性框架,通过信噪比模型揭示跨模态对齐与预测的互补失效模式,构建四区域相图指导多模态学习目标选择,并在非线性实验中验证。

详情
AI中文摘要

跨模态对齐(CA)和跨模态预测(CP)是多模态表示学习的主要范式,但目前缺乏对每种方法何时成功、何时失败以及跨模态训练何时有帮助的系统性理解——这一空白使得从业者,特别是在生物医学或天体物理学等科学领域,面对异构仪器以及多个层次的组织和测量时,无法诊断为什么标准方法不如最佳单模态。我们开发了一个统一的线性框架来解决这两个问题。在具有结构化跨模态干扰相关性的尖峰信号加噪声模型下,我们推导出两个目标的分离比,揭示了互补的失效模式:对齐使每个模态白化,当干扰在视图间强相关时失败;预测通过单侧白化编码任何可跨模态预测的内容,恢复由源模态质量决定。由此产生的相图将多模态问题划分为四个区域:两者、仅CA、仅CP和两者都不。我们提出了一种数据驱动的方法,使用少量标记子样本将真实世界数据集定位在该图中,在任何跨模态训练之前确定首选目标和预测方向。在合成数据、立体视觉基准、图像-文本对和真实天体物理数据上的实验验证了非线性情况下的预测,包括跨模态训练有害的“两者都不”区域。我们的框架使从业者能够诊断其多模态问题,并在投入训练之前选择正确的目标。重现结果的代码可在此https URL获取。

英文摘要

Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all -- a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image-caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training. Code to reproduce the results is available at https://github.com/IlayMalinyak/mm_align_vs_pred.

2606.11189 2026-06-10 cs.LG cs.AI cs.CL 新提交

A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

通过目标分布设计审视监督微调的统一视角

Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, Cho-Jui Hsieh

发表机构 * University of California, Los Angeles (UCLA)(加州大学洛杉矶分校) Arena

AI总结 本文重新解读监督微调为目标分布设计,提出Q-target框架,将监督分解为对观测token的依赖强度与替代token的概率分配,并基于此提出Target-SFT方法,在多个推理任务中优于现有方法。

详情
AI中文摘要

监督微调(SFT)通常最大化示范轨迹中每个token的似然。然而,观测到的token可能非唯一、有噪声或与模型先验不一致。严格拟合这种one-hot目标可能不是最优的,尤其是当预训练模型编码了丰富的知识先验时。在这项工作中,我们将SFT重新解释为目标分布设计:不仅研究损失目标,还分析损失驱动模型匹配的token级目标。我们引入Q-target框架,将SFT监督分解为两个明确的选择:(1) 对观测token的依赖强度,以及(2) 如何将剩余概率质量分配给替代token。这一视角将许多现有的SFT变体统一为目标分布Q的隐式选择。基于这一观点,我们提出Target-SFT,直接从期望的目标分布构建训练目标。该方法在十个推理数据集-模型设置中一致优于现有方法,展示了这种基于目标的方法的有效性。总体而言,我们的公式揭示了SFT训练更基本的设计原则,并为SFT目标开辟了更广阔的搜索空间。

英文摘要

Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.