arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2411.08821 2026-06-17 stat.ML cs.LG stat.CO 版本更新

Conditional Local Importance by Quantile Expectations

基于分位数期望的条件局部重要性

Kelvyn K. Bladen, Adele Cutler, D. Richard Cutler, Kevin R. Moon

AI总结提出模型无关的局部变量重要性方法CLIQUE，通过分位数期望捕获局部依赖关系，提升稳定性并直接适用于多类分类问题。

Comments 29 pages, 28 figures

详情

Journal ref: Transactions on Machine Learning Research (2026)

AI中文摘要

全局变量重要性度量通常用于解释机器学习模型的结果。局部变量重要性技术评估变量如何影响单个观测。当前流行的方法，包括LIME和SHAP，在预测空间中提供了有用的特征贡献度量，但在模型损失空间中改进局部结构表征方面仍有空间。此外，它们本身不适用于多类分类问题。我们提出了一种新的模型无关的局部变量重要性计算方法CLIQUE，它突出局部依赖关系，比基于置换的方法具有更好的稳定性，并且可以直接应用于多类分类问题。模拟和真实示例表明，CLIQUE强调局部依赖信息，捕获超出相关性可评估的交互行为，并在响应变量对变量变化不变的区域分配零重要性。

英文摘要

Global variable importance measures are commonly used to interpret the results of machine learning models. Local variable importance techniques assess how variables contribute to individual observations. Current, popular methods, including LIME and SHAP, provide useful measures of feature contribution in the prediction space, while leaving opportunities for improved characterization of local structure in the model loss space. Additionally, they are not natively adapted for multi-class classification problems. We propose a new model-agnostic method for calculating local variable importance, CLIQUE, that highlights locally dependent relationships, provides improved stability over permutation-based methods, and can be directly applied to multi-class classification problems. Simulated and real-world examples show that CLIQUE emphasizes locally dependent information, captures interaction behavior beyond what can be evaluated by correlations, and assigns zero importance in regions where the response is invariant to changes in variables.

URL PDF HTML ☆

赞 0 踩 0

2603.04198 2026-06-17 stat.ML cs.LG 版本更新

Stable and Steerable Sparse Autoencoders with Weight Regularization

基于权重正则化的稳定且可操控的稀疏自编码器

Piotr Jedryszek, Oliver M. Crook

AI总结通过L1/L2权重正则化提高稀疏自编码器的跨种子特征一致性，并在语言模型上提升操控成功率，同时保持可解释性分数。

详情

AI中文摘要

稀疏自编码器（SAEs）被广泛用于从神经网络激活中提取人类可解释的特征，但其学习到的特征在不同随机种子和训练选择下可能差异很大。为了提高稳定性，我们研究了通过添加编码器和解码器权重的L1或L2惩罚进行权重正则化，并评估了正则化与常见SAE训练默认值的交互作用。在MNIST上，我们观察到L2权重正则化产生了一个高度对齐的特征核心，并且当与绑定初始化和单位范数解码器约束结合时，它显著提高了跨种子的特征一致性。对于在语言模型激活（Pythia-70M-deduped）上训练的TopK SAEs，添加小的L2权重惩罚增加了三个随机种子间共享特征的比例，并使操控成功率大致翻倍，同时自动可解释性分数的平均值基本保持不变。最后，在正则化设置下，激活操控成功与否能更好地由自动可解释性分数预测，这表明正则化可以使基于文本的特征解释与功能可控性对齐。

英文摘要

Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability.

URL PDF HTML ☆

赞 0 踩 0

2603.02159 2026-06-17 stat.ML cs.LG 版本更新

超越独立基因：学习模块归纳表示用于单细胞基因扰动预测

Jiafa Ruan, Ruijie Quan, Liyang Xu, Zongxin Yang, Yi Yang

AI总结提出scBIG框架，通过基因关系聚类、基因簇感知编码器和结构感知对齐学习协调的基因程序模块表示，结合条件流匹配实现灵活泛化的扰动预测，在多个单细胞扰动基准上平均提升6.7%。

详情

AI中文摘要

质疑共形预测中的覆盖-长度度量：当更短的区间并不更好时

Yizhou Min, Yizhou Lu, Lanqi Li, Zhen Zhang, Jiaye Teng

AI总结本文批判性检验共形预测中标准度量（覆盖率和区间长度）的充分性，揭示一种称为“偏见技巧”（PT）的反直觉方法可欺骗性地缩短区间长度而保持覆盖有效，并提出新度量“区间稳定性”以检测此类行为。

详情

AI中文摘要

共形预测（CP）已成为无分布不确定性量化的基石，通常通过其覆盖率和区间长度进行评估。本文批判性地检验了这些标准度量的充分性。我们证明，通过一种称为偏见技巧（PT）的反直觉方法，区间长度可能被欺骗性地改善，而覆盖率仍然有效。具体而言，对于任何给定的测试样本，PT 概率性地返回一个区间，该区间要么为空，要么使用调整后的置信水平构建，从而保持边际覆盖率。虽然 PT 可能产生欺骗性较低的区间长度，但它引入了实际漏洞：同一输入在算法的重复运行中可能产生完全不同的预测区间。我们正式推导了 PT 实现这些误导性改进的条件，并在各种回归和分类任务中提供了广泛的实证证据。此外，我们引入了一个新度量——区间稳定性，它有助于检测新的 CP 方法是否基于此类 PT 技术隐式地改善了长度。代码可在 https://this URL 获取。

英文摘要

Conformal prediction(CP) has become a cornerstone of distribution-free uncertainty quantification, conventionally evaluated by its coverage and interval length. This work critically examines the sufficiency of these standard metrics. We demonstrate that the interval length might be deceptively improved through a counter-intuitive approach termed Prejudicial Trick(PT), while the coverage remains valid. Specifically, for any given test sample, PT probabilistically returns an interval, which is either null or constructed using an adjusted confidence level, thereby preserving marginal coverage. While PT potentially yields a deceptively lower interval length, it introduces practical vulnerabilities: the same input can yield completely different prediction intervals across repeated runs of the algorithm. We formally derive the conditions under which PT achieves these misleading improvements and provide extensive empirical evidence across various regression and classification tasks. Furthermore, we introduce a new metric interval stability which helps detect whether a new CP method implicitly improves the length based on such PT-like techniques. Code is available at https://github.com/benben-cd/PT-Conformal-Prediction.

URL PDF HTML ☆

赞 0 踩 0

2601.06862 2026-06-17 cs.CR cs.CV cs.LG cs.MM eess.IV 版本更新

基于相似性核的稳健局部多项式回归

Yaniv Shulman

AI总结针对传统局部多项式回归对异常值敏感的问题，提出一种结合响应变量信息的条件密度核加权方法，通过局部密度估计降低异常值影响，在保持与标准LOWESS竞争力同时降低经验偏差。

详情

AI中文摘要

局部多项式回归（LPR）因其灵活性和简单性，是一种广泛使用的非参数方法，用于建模复杂关系。它通过拟合低阶多项式到数据的局部子集（按邻近度加权）来估计回归函数。然而，传统的LPR对异常值和高杠杆点敏感，这些点会显著影响估计精度。本文重新审视用于计算回归权重的核函数，并提出一种新颖的框架，将预测变量和响应变量都纳入加权机制。本工作的重点是一种条件密度核，通过局部密度估计减轻异常值的影响，从而稳健地估计权重。所提出的方法已在Python中实现，并在此https URL公开提供。总体分析量化了基于密度的稳健加权引起的偏差，报告的实验显示，与迭代稳健LOWESS相比，经验偏差更低，同时与标准LOWESS保持竞争力。这一进展为传统LPR提供了有前景的扩展，为稳健回归应用开辟了新的可能性。

英文摘要

Local Polynomial Regression (LPR) is a widely used nonparametric method for modeling complex relationships due to its flexibility and simplicity. It estimates a regression function by fitting low-degree polynomials to localized subsets of the data, weighted by proximity. However, traditional LPR is sensitive to outliers and high-leverage points, which can significantly affect estimation accuracy. This paper revisits the kernel function used to compute regression weights and proposes a novel framework that incorporates both predictor and response variables in the weighting mechanism. The focus of this work is a conditional density kernel that robustly estimates weights by mitigating the influence of outliers through localized density estimation. The proposed method is implemented in Python and is publicly available at https://github.com/yaniv-shulman/rsklpr. The population analysis quantifies the bias induced by density-based robust weighting, and the reported experiments show lower empirical bias than iterative robust LOWESS while remaining competitive with standard LOWESS. This advancement provides a promising extension to traditional LPR, opening new possibilities for robust regression applications.

URL PDF HTML ☆

赞 0 踩 0

2507.05164 2026-06-17 math.DS cs.LG nlin.AO 版本更新

A Dynamical Systems Perspective on the Analysis of Neural Networks

神经网络分析的动力学系统视角

Dennis Chemnitz, Maximilian Engel, Christian Kuehn, Sara-Viola Kuntz

AI总结利用动力学系统重新表述深度神经网络、梯度下降等挑战，研究信息传播、训练动态和平均场极限，揭示网络嵌入、稳定性及图极限等性质。

Comments preprint of a book chapter contribution

详情

AI中文摘要

在本章中，我们利用动力学系统分析机器学习算法的几个方面。作为阐述性贡献，我们展示了如何将深度神经网络、（随机）梯度下降及相关主题中的各种挑战重新表述为动力学陈述。我们还解决了三个具体挑战。首先，我们考虑信息通过神经网络的传播过程，即研究不同架构下的输入-输出映射。我们解释了增强神经ODE的通用嵌入性质（可表示给定正则性的任意函数）、根据合适函数类对多层感知器和神经ODE的分类，以及神经延迟方程中的记忆依赖性。其次，我们从动力学角度考虑神经网络的训练方面。我们描述了梯度下降的动力学系统视角，并研究了超定问题的稳定性。然后我们将此分析扩展到过参数化设置，并描述了稳定性边缘现象，也涉及隐式偏差的可能解释。对于随机梯度下降，我们通过插值解的Lyapunov指数展示了过参数化设置的稳定性结果。第三，我们解释了关于神经网络平均场极限的几个结果。我们描述了一个结果，该结果通过有向图测度将现有技术扩展到涉及图极限的异质神经网络。这表明大类神经网络自然落入图上Kuramoto型模型及其大图极限的框架内。最后，我们指出使用动力学研究可解释和可靠AI的类似策略也可应用于生成模型或梯度训练方法中的基本问题（如反向传播或梯度消失/爆炸）等设置。

英文摘要

In this chapter, we utilize dynamical systems to analyze several aspects of machine learning algorithms. As an expository contribution we demonstrate how to re-formulate a wide variety of challenges from deep neural networks, (stochastic) gradient descent, and related topics into dynamical statements. We also tackle three concrete challenges. First, we consider the process of information propagation through a neural network, i.e., we study the input-output map for different architectures. We explain the universal embedding property for augmented neural ODEs representing arbitrary functions of given regularity, the classification of multilayer perceptrons and neural ODEs in terms of suitable function classes, and the memory-dependence in neural delay equations. Second, we consider the training aspect of neural networks dynamically. We describe a dynamical systems perspective on gradient descent and study stability for overdetermined problems. We then extend this analysis to the overparameterized setting and describe the edge of stability phenomenon, also in the context of possible explanations for implicit bias. For stochastic gradient descent, we present stability results for the overparameterized setting via Lyapunov exponents of interpolation solutions. Third, we explain several results regarding mean-field limits of neural networks. We describe a result that extends existing techniques to heterogeneous neural networks involving graph limits via digraph measures. This shows how large classes of neural networks naturally fall within the framework of Kuramoto-type models on graphs and their large-graph limits. Finally, we point out that similar strategies to use dynamics to study explainable and reliable AI can also be applied to settings such as generative models or fundamental issues in gradient training methods, such as backpropagation or vanishing/exploding gradients.

URL PDF HTML ☆

赞 0 踩 0

2410.08562 2026-06-17 cond-mat.mtrl-sci cs.LG 版本更新

Adaptable Method for Crystal Design across Diverse Constraints and Objectives with Pretrained Property Predictors

基于预训练属性预测器的可适应方法用于跨多样约束与目标的晶体设计

Akihiro Fujii, Yoshitaka Ushiku, Koji Shimizu, Anh Khoa Augustin Lu, Satoshi Watanabe

AI总结提出一种直接预测器引导的梯度优化方法，结合现成预测器、位点元素掩码、模板初始化和任务特定损失，实现数据高效、约束丰富的晶体设计，在钙钛矿中优于生成和贝叶斯基线，并支持半金属设计。

详情

AI中文摘要

先进的晶体设计可以加速从光伏到自旋电子学等应用中的材料发现。实际设计必须满足多种属性和物理约束，然而现有的基于机器学习的方法通常依赖于大型数据集、重新训练或任务特定的生成器。在这里，我们展示了直接预测器引导的梯度优化通过结合现成预测器与位点元素掩码、模板初始化和任务特定损失，实现了数据高效、约束丰富的晶体设计。在钙钛矿中，它在三个目标——带隙、形成能和容忍因子——以及两个硬约束下优于生成和贝叶斯基线。DFT评估进一步表明，尽管使用的预测器训练数据约为领先生成模型的十分之一，其带隙目标性能仍具有竞争力。通过灵活组合预训练预测器与应用导向的掩码和自定义损失，同一框架支持半金属设计。这种模块化可以帮助研究人员和工程师将多样化的应用需求直接转化为优化的候选晶体，且计算成本最低。

英文摘要

Advanced crystal design can accelerate materials discovery across applications from photovoltaics to spintronics. Practical design must satisfy multiple properties and physical constraints, yet existing machine-learning-based approaches to such design often depend on large datasets, retraining, or task-specific generators. Here, we show that direct predictor-guided gradient optimization enables data-efficient, constraint-rich crystal design by combining off-the-shelf predictors with site-wise element masks, template initialization, and task-specific losses. In perovskites, it outperformed generative and Bayesian baselines under three targets -- band gap, formation energy, and tolerance factor -- and two hard constraints. DFT assessment further showed band-gap targeting competitive with a leading generative model despite using predictors trained on roughly one-tenth of the data. By flexibly combining pretrained predictors with application-oriented masks and custom losses, the same framework supported half-metal design. Such modularity could help researchers and engineers translate diverse application requirements directly into optimized candidate crystals with minimal computational cost.

URL PDF HTML ☆

赞 0 踩 0

2405.15379 2026-06-17 stat.ML cs.LG math.PR math.ST stat.TH 版本更新

标准化痴呆筛查测试的自动化评估

Franziska Braun, Markus Förstel, Bastian Oppermann, Andreas Erzigkeit, Thomas Hillemacher, Hartmut Lehfeld, Korbinian Riedhammer

AI总结本文研究了标准化痴呆筛查测试的自动化评分方法，通过分析手动和自动转录本的评分相关性，发现自动评分在某些任务上比人工评分更严格，但整体仍保持高相关性。

Comments Submitted to Interspeech 2022. arXiv admin note: text overlap with arXiv:2206.05018

详情

DOI: 10.21437/Interspeech.2022-10436
Journal ref: Proceedings of Interspeech 2022

AI中文摘要

在痴呆筛查和监测中，标准化测试在临床实践中起关键作用，因为它们旨在通过测量多种认知任务的表现来最小化主观性。本文报告了一项研究，该研究包括一个半标准化的病史采集，随后是两种标准化的神经心理学测试，即SKT和CERAD-NB。这些测试包括命名物体、学习词列表等基本任务，以及广泛使用的工具如MMSE。大多数任务是口头进行的，因此应适合基于转录文本的自动化评分。对于前30名患者的第一批，我们分析了专家手动评分与基于手动和自动转录的自动评分之间的相关性。对于SKT和CERAD-NB，我们观察到使用手动转录本时的高到完美相关性；对于某些相关性较低的任务，自动评分比人类参考更严格，因为其仅限于音频。使用自动转录本时，相关性下降如预期，与识别准确性相关；然而，我们仍观察到高达0.98（SKT）和0.85（CERAD-NB）的高相关性。我们证明使用词替代可以缓解识别错误，从而提高与专家评分的相关性。

英文摘要

For dementia screening and monitoring, standardized tests play a key role in clinical routine since they aim at minimizing subjectivity by measuring performance on a variety of cognitive tasks. In this paper, we report on a study that consists of a semi-standardized history taking followed by two standardized neuropsychological tests, namely the SKT and the CERAD-NB. The tests include basic tasks such as naming objects, learning word lists, but also widely used tools such as the MMSE. Most of the tasks are performed verbally and should thus be suitable for automated scoring based on transcripts. For the first batch of 30 patients, we analyze the correlation between expert manual evaluations and automatic evaluations based on manual and automatic transcriptions. For both SKT and CERAD-NB, we observe high to perfect correlations using manual transcripts; for certain tasks with lower correlation, the automatic scoring is stricter than the human reference since it is limited to the audio. Using automatic transcriptions, correlations drop as expected and are related to recognition accuracy; however, we still observe high correlations of up to 0.98 (SKT) and 0.85 (CERAD-NB). We show that using word alternatives helps to mitigate recognition errors and subsequently improves correlation with expert scores.

URL PDF HTML ☆

赞 0 踩 0

2606.17977 2026-06-17 econ.EM 新提交

Beyond Parallel Trends in Staggered Difference-in-Differences: Identification under Higher-Order Parallelism

超越交错双重差分中的平行趋势：高阶平行性下的识别

Zecharias Anteneh

AI总结本文提出高阶平行性假设层次，替代传统平行趋势假设，在交错双重差分设计中实现队列特定和平均处理效应的点识别，并证明聚合定理。

Comments 38 pages, 4 figures. Companion Stata command (anddp) implementing the estimator will be available soon at https://github.com/zanteneh/anddp

详情

AI中文摘要

在双重差分设计中，平行趋势假设要求处理组和对照组之间的结果差距在未处理情况下保持平坦。预处理事件研究经常拒绝这一平坦差距要求。现有的应对措施包括参数趋势控制以及基于违规程度假设的处理效应边界。本文表明，在严格更弱的假设下，交错设计中队列特定和平均处理效应的点识别仍然可以实现。我将平坦差距要求替换为高阶条件层次 Parallel[p]，将该框架嵌入 Callaway 和 Sant'Anna (2021) 的组-时间平均处理效应结构中，并证明了一个聚合定理，该定理适用于不同队列在不同可行多项式阶数下被识别的情况，这是交错设计特有的此前未解决的挑战。一个序贯阶数选择程序指导应用实践。蒙特卡洛证据证实，选择后自助法覆盖接近名义水平，且推断对现实序列相关具有稳健性。应用于医疗补助扩展数据，该方法得到的点估计基于预处理数据未拒绝的假设，而同样的数据明确拒绝了平坦差距要求。

英文摘要

In difference-in-differences designs, the parallel trends assumption requires that the outcome gap between treated and control units would have remained flat absent treatment. Pre-treatment event studies frequently reject this flat-gap requirement. Existing responses include parametric trend controls and bounds on the treatment effect under assumptions about the magnitude of the violation. This paper shows that point identification of cohort-specific and aggregate treatment effects in staggered designs remains achievable under strictly weaker assumptions. I replace the flat-gap requirement with a hierarchy of higher-order conditions, Parallel[p], embed this framework in the group-time average treatment effect structure of Callaway and Sant'Anna (2021), and prove an aggregation theorem for the case where different cohorts are identified under different feasible polynomial orders, a challenge unique to staggered designs that has not been previously addressed. A sequential order-selection procedure guides applied practice. Monte Carlo evidence confirms that post-selection bootstrap coverage remains near-nominal and that inference is robust to realistic serial correlation. Applied to Medicaid expansion data, the method yields point estimates resting on an assumption the pre-treatment data do not reject, in contrast to the flat-gap requirement which those same data decisively reject.

URL PDF HTML ☆

赞 0 踩 0

2606.18196 2026-06-17 eess.SP 新提交

Receiver-Aware Analysis and Verification of the Spectral Separation Coefficient Under Interference-Induced Degradation

接收机感知的干扰诱导退化下频谱分离系数的分析与验证

Lucas Heublein, Fabian Benschuh, Alexander Rügamer, Felix Ott

AI总结本文通过引入接收机前端特性计算依赖接收机的频谱分离系数（SSC），并利用真实和仿真数据集实验验证了干扰影响计算的鲁棒性。

Comments 7 pages, 4 figures

详情

AI中文摘要

干扰对基于卫星的定位系统构成重大挑战，因此准确量化特定干扰类型对接收机性能以及由此产生的位置计算可靠性的影响至关重要。当前实践中，干扰影响通常使用与接收机无关的指标进行量化，而接收机特定的前端特性要么被理想化，要么仅被隐含考虑。在本文中，我们通过将接收机特定的前端特性明确纳入干扰影响的计算中，并通过实验验证所得的依赖接收机的分析，来解决这一局限性。因此，我们记录了一个包含210个不同干扰场景的真实世界开放场数据集，并针对特定接收机模块计算了依赖接收机的频谱分离系数（SSC）和干扰影响。此外，我们使用由射频星座模拟器（RFCS）生成的受控数据集验证了计算，该模拟器采用相同的接收机模块并回放类似的干扰类别。两种环境下获得的结果比较证明了干扰影响计算的鲁棒性。

英文摘要

Interference poses a significant challenge to satellite-based positioning systems, making it essential to accurately quantify the effects of specific interference types on receiver performance and the resulting reliability of position computation. In current practice, interference effects are often quantified using receiver-independent metrics, with receiver-specific front-end characteristics either idealized or only implicitly considered. In this paper, we address this limitation by explicitly incorporating receiver-specific front-end characteristics into the computation of interference effects and validating the resulting receiver-dependent analysis experimentally. Therefore, we record a real-world open-field dataset comprising 210 distinct interference scenarios and compute the receiver-dependent spectral separation coefficient (SSC) and interference impact for a specific receiver module. Furthermore, we verify the computation using a controlled dataset generated with a radio frequency constellation simulator (RFCS), employing the same receiver module and replaying similar interferences classes. The comparison of results obtained in both environments demonstrates the robustness of the interference impact computation.

URL PDF HTML ☆

赞 0 踩 0

2606.18134 2026-06-17 eess.AS 新提交

Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning

通过说话人日志条件将口语大语言模型扩展到多说话人音频

Alexander Polok, Samuele Cornell, Sathvik Udupa, Jan Černocký, Shinji Watanabe, Lukáš Burget

AI总结提出基于说话人日志条件的口语语言模型，通过条件化声学编码器提取目标说话人表示，避免序列化输出训练导致的灾难性遗忘，在多个数据集上显著提升说话人属性转录性能。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

我们提出了说话人日志条件的口语语言模型（SLMs），这是一种将SLMs扩展到远场多说话人音频的策略。不同于通过序列化输出训练来调整解码器（这有灾难性遗忘的风险），我们通过说话人日志掩码条件化声学编码器以提取目标说话人表示，同时保持解码器冻结。我们将其实例化为Dixtral，将说话人日志条件的Whisper（DiCoW）编码器集成到Voxtral SLM中。在AMI、NOTSOFAR-1、LibriSpeechMix和Mixer6上，Dixtral在说话人属性转录方面分别以29.0%、19.8%和16.0%的绝对cpWER优于Gemini 3.0 Flash、VibeVoice和Voxtral Mini Transcribe V2。在一个新颖的长篇多说话人问答基准上，零样本Dixtral在远场内容理解上与Gemini持平，而经过微调后，在所有任务上均超越了Gemini和基于近讲语音的Voxtral。

英文摘要

We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization masks to extract target-speaker representations, keeping the decoder frozen. We instantiate this as Dixtral, integrating a Diarization Conditioned Whisper (DiCoW) encoder into the Voxtral SLM. On AMI, NOTSOFAR-1, LibriSpeechMix, and Mixer6, Dixtral outperforms Gemini 3.0 Flash, VibeVoice, and Voxtral Mini Transcribe V2 on speaker-attributed transcription by 29.0%, 19.8%, and 16.0% absolute cpWER respectively. On a novel long-form multi-speaker QA benchmark, zero-shot Dixtral matches Gemini on far-field content understanding, and when fine-tuned surpasses both Gemini and Voxtral operating on close-talk across all tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.18072 2026-06-17 eess.AS 新提交

One-Step Token-to-Waveform Generation with MeanFlow in Latent Space

基于潜在空间中MeanFlow的一步式Token到波形生成

Zheqi Dai, Guangyan Zhang, Zhen Ye, Jingyu Li, Haolin He, Chunyat Wu, Yiwen Guo, Qiuqiang Kong

AI总结提出MeanFlow在高度压缩潜在空间中实现一步式Token2Wav生成，解决多步流匹配解码器的速度-质量权衡，RTF提升17倍且质量损失可忽略。

Comments 5 pages, 1 figure

详情

AI中文摘要

神经音频编解码器是现代基于LLM的文本到语音（TTS）和多模态系统的核心。随着低比特率语义编解码器的重要性日益增加，Token到波形（Token2Wav）解码器成为决定感知质量和系统效率的瓶颈。传统的多步流匹配解码器提供了卓越的质量，但由于迭代采样导致高推理延迟，造成了严重的质量-速度权衡。在本文中，我们提出了一种新颖的Token2Wav架构，通过在高度压缩的潜在空间中应用MeanFlow来克服这一限制。通过建模平均速度而非瞬时速度场，MeanFlow实现了真正的一步生成。在潜在域中操作减轻了波形级流的内存和稳定性问题，与多步基线相比，实时因子（RTF）提升了高达17倍，且质量下降可忽略。此外，我们引入了缓解潜在不匹配的细化策略，包括冻结MeanFlow生成器的仅解码器微调和端到端联合微调，在不增加推理时间成本的情况下提高了保真度。代码和演示已公开。

英文摘要

Neural audio codecs are central to modern LLM-based Text-to-Speech (TTS) and multimodal systems. As low-bitrate semantic codecs gain prominence, the Token-to-Waveform (Token2Wav) decoder becomes a bottleneck determining both perceptual quality and system efficiency. Conventional multi-step flow-matching decoders offer superior quality but suffer from high inference latency due to iterative sampling, creating a severe quality-speed trade-off. In this paper, we propose a novel Token2Wav architecture that overcomes this limitation by applying MeanFlow in a highly compressed latent space. By modeling the average velocity rather than the instantaneous velocity field, MeanFlow enables true one-step generation. Operating in the latent domain mitigates the memory and stability issues of waveform-level flows, yielding up to a 17$\times$ improvement in Real-Time Factor (RTF) compared to multi-step baselines with negligible quality degradation. Furthermore, we introduce refinement strategies that mitigate latent mismatch, including decoder-only fine-tuning with the MeanFlow generator frozen and end-to-end joint fine-tuning, improving fidelity without increasing inference-time cost. Code and demo are publicly available.

URL PDF HTML ☆

赞 0 踩 0