A Functional Data Framework For Analyzing Shapes and Textures in Images
图像形状与纹理分析的函数数据框架
AI总结 提出一种基于函数数据分析的星形域图像表示方法,降低维度与计算成本,并应用于监督分类。
图像形状与纹理分析的函数数据框架
Issam-Ali Moindjié
AI总结 提出一种基于函数数据分析的星形域图像表示方法,降低维度与计算成本,并应用于监督分类。
图像表示由轮廓和纹理特征刻画的物体。从统计角度看,这些特征可定义为连续随机函数的观测。然而,大多数现有方法依赖于基于像素的离散化,导致高维表示和沉重的计算成本。本文介绍了一种更经济的替代表示。该表示假设物体具有星形域内部。在此条件下,我们从函数数据分析的角度探索图像分析。所提出的框架在真实数据监督图像分类问题上进行了说明。
Images represent objects characterized by contours and textures. From a statistical perspective these features can be defined as observations of continuous random functions. However, most existing approaches rely on pixel-based discretizations which lead to high-dimensional representations and heavy computational costs. In this note, we introduce an alternative more frugal representation. This representation assumes that the object has a star-shaped domain interior. Under this condition, we explore the analysis of images from a functional data analysis perspective. The proposed framework is illustrated on a real data supervised image classification problem.
基于熵正则最优传输的两样本同质性检验
Yiming Ma, Hang Liu, Weiwei Zhuang
AI总结 提出基于熵正则最优传输映射的两样本同质性检验,利用平方L2距离作为统计量,证明可识别性、中心极限定理及局部渐近功效,并通过加权乘子自助法校准零分布。
本文提出了一种基于熵正则最优传输(EOT)映射的两样本同质性检验,该映射来自一个共同的参考分布——单位球上的均匀分布。检验统计量是两个经验EOT映射之间的平方$L^2$距离。对于固定的熵正则化参数,我们证明了总体映射差异是可识别的,推导了零假设下经验映射差异的函数中心极限定理,并建立了高斯二次型零极限。我们还证明了对固定备择假设的一致性,并刻画了连续备择假设下的局部渐近功效。提出了一种加权乘子自助法来校准非枢轴零分布,并证明了其有效性。大量模拟表明,所提出的EOT映射检验具有可靠的有限样本大小控制,并且与其他现有方法相比具有竞争性的功效。该方法对于位置备择假设特别有效,并且除了单一的标量差异外,它还提供了关于两个分布如何不同的额外诊断信息。最后,一个真实数据应用结束了本文。
This paper proposes a two-sample homogeneity test based on entropic optimal transport (EOT) maps from a common reference distribution -- the uniform law on the unit ball. The test statistic is the squared $L^2$-distance between the two empirical EOT maps. For fixed entropic regularization parameter, we prove that the population map discrepancy is identifiable, derive a functional central limit theorem for the empirical map difference under the null, and establish the Gaussian quadratic-form null limit. We also prove consistency against fixed alternatives and characterize local asymptotic power under contiguous alternatives. A weighted multiplier bootstrap is proposed to calibrate the non-pivotal null distribution, and its validity is established. Extensive simulations demonstrate that the proposed EOT-map test has reliable finite-sample size control and exhibits competitive power compared with other existing methods. The method is particularly powerful for location alternatives and, beyond a single scalar discrepancy, it provides additional diagnostic information on how the two distributions differ. Finally, a real data application concludes the paper.
基于加权共形预测从历史调查数据预测当前结果
Chihoon Lee, Sungkyu Jung, Hyokyung G. Hong
AI总结 针对大规模调查中部分结果仅在特定年份测量的缺失问题,提出加权共形预测框架,通过估计历史与目标协变量分布间的似然比,实现有效的总体水平预测,并保证覆盖概率。
在诸如国家健康与营养调查(NHANES)等大规模复杂调查中,某些结果仅在选定的年份进行测量,导致不同调查波次间记录不完整。我们开发了一个加权共形预测框架,能够利用早期调查的信息对未观测到的结果进行有效的总体水平预测。该方法适应协变量偏移,其中连续和分类协变量的分布随时间演变,同时调查设计影响代表性。它整合了子组特定的密度比和子组比例估计,以近似历史与目标协变量分布之间的似然比,并且我们为所得预测集建立了覆盖保证。模拟研究和一项预测当前美国人口低密度脂蛋白胆固醇(LDL-C)的应用表明,所提出的方法实现了接近名义水平的覆盖,并且在效率上优于现有方法,特别是在协变量分布复杂或未知的情况下。
In large-scale complex surveys such as the National Health and Nutrition Examination Survey (NHANES), some outcomes are measured only in selected years, leaving incomplete records across survey waves. We develop a weighted conformal prediction framework that enables valid population-level prediction of unobserved outcomes using information from earlier surveys. The method accommodates covariate shift, where both continuous and categorical covariate distributions evolve over time while survey design affects representativeness. It integrates subgroup-specific density ratio and subgroup-proportion estimation to approximate likelihood ratios between the historical and target covariate distributions, and we establish coverage guarantees for the resulting prediction sets. Simulation studies and an application predicting low-density lipoprotein cholesterol (LDL-C) for the current U.S. population show that the proposed approach achieves coverage close to the nominal level and improved efficiency over existing methods, particularly when covariate distributions are complex or unknown.
在函数形式的灵活建模中调整协变量测量误差的方法:一项盲法、受控中性比较模拟研究的结果
Mohammed Sedki, Aris Perperoglou, Anne C. M. Thiébaut, Steve Ferreira Guerra, Paul Gustafson, Frank E. Harrell, Willi Sauerbrei, Michal Abrahamowicz, Laurence S. Freedman
AI总结 通过盲法多阶段中性比较模拟研究,评估了六类测量误差校正方法与四种灵活回归模型结合在非线性关联估计中的表现,发现点态SIMEX最准确稳健,贝叶斯方法和回归校准次之,多重插补较差,B样条最差。
协变量测量误差在流行病学研究中普遍存在,并扭曲估计的暴露-结果关联,然而校正方法几乎仅在线性建模假设下研究。当潜在关联是非线性且本身通过灵活回归估计时,这些方法的行为仍不清楚。我们报告了一项在STRATOS倡议内进行的盲法、多阶段中性比较模拟研究,评估了测量误差校正与函数形式灵活建模的结合。六类校正方法(点态和基于系数的模拟外推[SIMEX]、对数尺度和风险尺度的贝叶斯推断、多重插补[MI]和回归校准[RC])分别与B样条(BS)、惩罚样条(PS)、分数多项式(FP)和自然样条(NS)结合,产生了23种分析方法。这些方法应用于在五种函数形式(J形、线性、两种阈值模型和饱和模型)下生成的病例对照数据,跨越不同样本量、重复子研究规模、误差幅度和误差分布的数据集,采用经典加性误差和用于误差校准的重复子研究。性能通过暴露分布中心95%范围内估计函数的对数均方误差进行评估。点态SIMEX总体最准确且最稳健,其次是贝叶斯方法和与PS、FP或NS配对的RC;MI表现较差,而使用无惩罚BS的贝叶斯估计表现最差。PS、FP和NS几乎等效,而BS始终较差。没有单一方法在所有场景中占主导地位,强调了敏感性分析的价值。
Covariate measurement error is pervasive in epidemiological research and distorts estimated exposure-outcome associations, yet correction methods have been studied almost exclusively under linear modelling assumptions. Their behaviour when the underlying association is non-linear and is itself estimated with flexible regression, remains poorly characterised. We report a blinded, multi-stage neutral comparison simulation study, conducted within the STRATOS initiative, evaluating measurement error correction coupled with flexible modelling of functional form. Six families of correction methods (pointwise and coefficient-based Simulation Extrapolation [SIMEX], Bayesian inference on the logit and risk scales, Multiple Imputation [MI], and Regression Calibration [RC]) were each combined with B-splines (BS), penalised splines (PS), fractional polynomials (FP), and natural splines (NS), yielding 23 analytic methods. Methods were applied to case-control data generated under five functional forms (J-shape, linear, two threshold models, and saturation) across simulated datasets spanning varying sample sizes, replication substudy sizes, error magnitudes, and error distributions, with classical additive error and a replication substudy for error calibration. Performance was assessed by the log mean squared error of the estimated function over the central 95 % of the exposure distribution. Pointwise SIMEX was the most accurate and most robust approach overall, followed by Bayesian methods and RC when paired with PS, FP, or NS; MI performed less well, and Bayesian estimation with unpenalised BS performed worst. PS, FP, and NS were near-equivalent, whereas BS was consistently inferior. No single method dominated across all scenarios, underscoring the value of sensitivity analyses.
稀疏采样下一维分布的Wasserstein重心估计
James Peng, Florian Stijven, Linbo Wang, Peter B. Gilbert
AI总结 针对每个单元仅通过少量独立同分布样本观测到一维分布的数据,提出边际构造重心(MCB)估计量,通过二项混合方法估计潜在分位数分布,克服稀疏采样下经验Wasserstein重心的偏差,并证明其一致性和渐近正态性。
我们研究稀疏采样下的分布数据,其中每个单元由实直线上的概率分布表示,仅通过少量独立同分布样本观测。一维分布数据的一个自然中心趋势概念是Wasserstein重心,其分位数函数是单元级分位数函数的逐点平均。我们关注Wasserstein重心分位数函数的逐点估计:在给定分位数水平下,目标是相应单元级分位数的总体均值。一个朴素的插件估计量是经验Wasserstein重心,它将观测到的单元级经验分布视为真实的潜在单元级分布。然而,在稀疏采样下,该估计量可能存在严重偏差。我们提出了一种避免直接估计单元级分布或分布总体分布的方法。我们从更宏大的目标开始:刻画给定分位数水平下潜在单元级分位数的分布。我们证明该分布可以用单元级CDF值的边际分布表示,而后者可以通过二项混合方法估计。这激发了我们的估计量——边际构造重心(MCB)估计量,通过取估计的潜在单元级分位数分布的均值得到。我们建立了MCB估计量逐点一致且渐近正态的条件,并通过模拟表明,在稀疏采样下它能够显著优于经验Wasserstein重心。我们在HVTN 502/503疫苗效力试验的HIV-1序列数据分析中说明了该方法,当每个参与者只有少量序列可用时,使用重心来总结和比较参与者内部病毒序列特征的分布。
We study distributional data under sparse sampling where each unit is represented by a probability distribution on the real line observed only through a small i.i.d.~sample. A natural notion of central tendency for one-dimensional distributional data is the Wasserstein barycenter, whose quantile function is the pointwise average of the unit-level quantile functions. We focus on pointwise estimation of the Wasserstein barycenter quantile function: at a given quantile level, the target is the population mean of the corresponding unit-level quantiles. A naive plug-in estimator is the empirical Wasserstein barycenter, which treats observed unit-level empirical distributions as the true latent unit-level distributions. Under sparse sampling, however, this estimator can be severely biased. We propose an approach that avoids directly estimating either the unit-level distributions or the full population law of distributions. We start with the more ambitious goal of characterizing the distribution of latent unit-level quantiles at a given quantile level. We show that this distribution can be written in terms of the marginal distributions of the unit-level CDF values, which can be estimated using binomial mixture methods. This motivates our estimator, the marginal-constructed barycenter (MCB) estimator, obtained by taking the mean of the estimated distribution of latent unit-level quantiles. We establish conditions under which the MCB estimator is pointwise consistent and asymptotically normal, and show through simulations that it can substantially outperform the empirical Wasserstein barycenter under sparse sampling. We illustrate the method in an analysis of HIV-1 sequence data from the HVTN 502/503 vaccine efficacy trials, using the barycenter to summarize and compare within-participant distributions of viral sequence features when only a small number of sequences are available per participant.
一种用于映射最大潜在生物多样性的信息几何框架
Shinto Eguchi
AI总结 提出信息几何框架,通过约束变分原理定义潜在组成和多样性差距,统一处理Hill型多样性和Rao二次熵,为生态保护提供基准比较。
生物多样性度量通常被描述性地使用:从观测或估计的群落组成计算多样性指数,并将结果值映射到空间上。然而,保护规划还需要一个特定地点的基准,以便将观测到的群落与之进行比较。本章为这种“潜在多样性”和相关的“多样性差距”开发了一个信息几何框架。核心对象是物种单纯形上的一对概率向量:观测或实现的组成\(p^{\mathrm{obs}}\),以及通过约束变分原理获得的潜在组成\(p^{\mathrm{pot}}\)。然后通过比较这两个组成处的多样性泛函来定义差距。该框架针对Hill型多样性(衡量丰度和均匀度)和Rao二次熵(包含物种间的性状、系统发育或生态差异)进行了开发。空间点过程解释阐明了如何在进入单纯形之前定义局部生态容量。然后,护航约束、容量约束和散度投影提供了一种统一的方法来定义超出均匀分布的非平凡基准。得到的公式区分了两个不同的问题:一个群落有多多样化,以及它离局部允许的潜在基准有多远。它还将暗多样性的生态概念与概率单纯形上的连续、丰度加权比较联系起来。我们还概述了一个动态扩展,其中容量、物种迁移和气候驱动的变化随时间变化。使用大规模公民科学生物多样性数据和性状数据库的实证实施留待未来工作。
Biodiversity measures are often used descriptively: one computes a diversity index from an observed or estimated community composition and maps the resulting values across space. Conservation planning, however, also requires a site-specific benchmark against which the observed community can be compared. This chapter develops an information-geometric framework for such \emph{potential diversity} and the associated \emph{diversity gap}. The central object is a pair of probability vectors on the species simplex: an observed or realized composition \(p^{\mathrm{obs}}\), and a potential composition \(p^{\mathrm{pot}}\) obtained by a constrained variational principle. The gap is then defined by comparing a diversity functional at these two compositions. The framework is developed for both Hill-type diversity, which measures abundance and evenness, and Rao's quadratic entropy, which incorporates trait, phylogenetic, or ecological dissimilarities among species. A spatial point-process interpretation clarifies how local ecological capacities can be defined before passing to the simplex. Escort constraints, capacity constraints, and divergence projections then provide a unified way to define nontrivial benchmarks beyond the uniform distribution. The resulting formulation separates two distinct questions: how diverse a community is, and how far it is from a locally admissible potential benchmark. It also connects the ecological idea of dark diversity with a continuous, abundance-weighted comparison on the probability simplex. We also outline a dynamic extension in which capacities, species migration, and climate-driven shifts vary over time. Empirical implementation with large-scale citizen-science biodiversity data and trait databases is left for future work.
修正随机森林产生的变量重要性评分
Guancheng Zhou, Haiping Xu, Jason Liu, Donghui Yan
AI总结 针对随机森林变量重要性受变量间相关性影响的问题,提出基于条件相关性的分组方法进行修正,实验证明两种计算高效方案均能有效校正变量重要性。
随机森林产生的变量重要性在统计分析中广泛应用,在辅助模型解释、模型选择和诊断、成本受限学习等任务中发挥重要作用。然而,RF中变量重要性的计算未考虑变量间的相关性,与许多其他变量相关的变量往往会获得较低的重要性指数,或被其他强相关变量完全掩盖(即重要性指数接近零)。为了在计算变量重要性时避免不相关变量的影响,我们提出根据变量的条件相关性(以响应变量为条件)对变量进行分组。我们探索了两种计算高效的方案:一种将变量单独分组,然后将感兴趣的变量与所有相关变量分离;另一种使用聚类根据变量间的成对条件相关性进行分组。实验表明,两种方法都能对变量重要性进行合理的修正。
Variable importance produced by Random Forests (RF) is used widely in statistical data analysis, and has played an important role in a variety of tasks such as assisting model interpretation, model selection and diagnosis, and cost-bounded learning etc. However, the calculation of variable importance in RF does not take into account of the correlations among variables, and variables that are correlated to many other variables tend to receive a lower importance index or being completely masked (i.e., with an importance index near zero) by other strongly correlated variables. To prevent influence from unwanted correlated variables in calculating variable importance, we propose to group variables by their conditional correlations (conditional on the response variable). We explore two computationally efficient options, with one grouping variables individually, and then separates the variable of interest from all correlated variables, while the other uses clustering to group variables according to their pair-wise conditional correlations. Our experiments show that both lead to sensible corrections to the importance of variables.
复杂缺失机制下二元回归的共形预测
Robert Lunde, Minjie Yang, Elizaveta Levina, Ji Zhu
AI总结 针对复杂缺失机制下的二元回归问题,提出共形预测框架,通过分布不变性条件替代可交换性,并利用双射论证处理随机子集样本,同时提出多种共形预测程序,包括图论加权方法,实现渐近条件有效性。
我们针对复杂缺失机制下的二元回归问题,建立了一个共形预测框架。在理论层面,我们在弱于可交换性的分布不变性条件下建立了共形预测的超均匀性。一个关键结果通过一种新颖的双射论证处理了样本本身是指标集的随机子集的情况,该情况未被现有理论覆盖,该论证构造了事件之间的显式保测对应。此外,我们针对联合可交换数组提出了共形预测程序,包括全共形、分裂共形、利用行和列内相似性的行列方法,以及实现掩码条件有效性的选择性共形程序。对于缺失元素,我们在缺失机制的非参数图论模型下建立了图论加权共形程序的渐近有效性。我们进一步建立了连续和离散响应的条件有效性结果;据我们所知,这是首次在非随机缺失假设下对加权共形预测的渐近条件有效性进行正式证明。所提出的方法在合成和真实网络数据上进行了说明。
We develop a framework for conformal prediction in dyadic regression problems under complex missingness mechanisms. At the theoretical level, we establish super-uniformity of conformal prediction under distributional invariance conditions weaker than exchangeability. A key result handles the case where the sample itself is a random subset of the index set, a setting not covered by existing theory, via a novel bijection argument that constructs an explicit measure-preserving correspondence between events. In addition, we propose conformal prediction procedures for jointly exchangeable arrays, including full conformal, split conformal, a row-column approach exploiting similarities within rows and columns, and a selective conformal procedure achieving mask-conditional validity. For missing elements, we establish asymptotic validity of a graphon-weighted conformal procedure under a nonparametric graphon model for the missingness mechanism. We further establish conditional validity results for both continuous and discrete responses; to the best of our knowledge, this is first formal proof of asymptotic conditional validity for weighted conformal prediction under a missing-not-at-random assumption. The proposed methods are illustrated on synthetic and real network data.
用于粗化结果的分层复合终点的概率胜率方法
Lei Li, Jing Lei, Yuexiao Dong
AI总结 提出概率胜率(PWR)方法,通过条件概率替代确定性比较,处理删失和缺失数据,提高效率并减少偏倚,在完全观测时退化为标准胜率。
胜率越来越多地用于分析临床试验中的优先复合终点,但标准实现依赖于确定性成对比较,在存在删失和特定终点缺失的情况下表现不佳。在这种情况下,未解决的比较通常被视为平局,导致效率损失和潜在的偏倚推断,尤其是当低优先级结果不完全观测时。我们提出了概率胜率(PWR),一个在粗化观测下估计经典胜率的框架。PWR用给定观测数据下的胜、负或平局的条件概率替代确定性成对决策,允许部分观测的比较按不确定性明确惩罚后贡献分数。比较的粗化程度越大,有效权重越小,而完全观测的比较与经典分析中一样贡献,保留了临床优先级结构。当结果完全观测时,PWR精确退化为标准胜率估计量。模拟研究表明,PWR在一系列删失和缺失场景中保持低偏倚和均方误差。两个临床试验案例研究展示了互补的数据机制,在近乎完整的数据中展示了校准,在大量右删失下展示了稳定性。
The win ratio is increasingly used to analyze prioritized composite endpoints in clinical trials, but standard implementations rely on deterministic pairwise comparisons and can perform poorly in the presence of censoring and endpoint-specific missingness. In such settings, unresolved comparisons are often treated as ties, leading to loss of efficiency and potentially biased inference, particularly when lower-priority outcomes are incompletely observed. We propose the probabilistic win ratio (PWR), a framework for estimating the classical win ratio under coarsened observation. The PWR replaces deterministic pairwise decisions with conditional probabilities of win, loss, or tie given the observed data, allowing partially observed comparisons to contribute fractionally while being explicitly penalized according to their uncertainty. Comparisons with greater coarsening receive smaller effective weight, whereas fully observed comparisons contribute as in the classical analysis, preserving the clinical priority structure. When outcomes are fully observed, the PWR reduces exactly to the standard win ratio estimator. Simulation studies show that the PWR maintains low bias and mean squared error across a range of censoring and missingness scenarios. Two clinical trial case studies illustrate complementary data regimes, demonstrating calibration in near-complete data and stability under substantial right censoring.
网络数据中子空间相等的双样本假设检验
Rajdeep Brahma, Joshua Agterberg, Yuguo Chen
AI总结 针对两个网络是否共享相同子空间(如社区结构)的零假设,提出基于投影矩阵差的Frobenius范数检验统计量,证明其在平均期望度对数增长下渐近正态,并给出均值和方差估计及局部功效。
在许多场景中,人们常常需要确定两个网络是否共享某些联合结构连接模式,例如社区。然而,尽管社区可能在网络间共享,边概率可能显著不同。因此,在本文中,我们考虑检验一个一般的零假设,即两个网络具有相同的潜在子空间,这特别包括社区相同的情形(对于随机块模型或混合成员随机块模型,即使边概率不同)。我们提出了一个基于前主子空间投影矩阵之差的Frobenius范数的检验统计量,并证明了当平均期望度随顶点数至少以对数增长时,我们的检验统计量在适当中心化和缩放后依分布收敛到高斯随机变量。然后,我们给出了渐近均值和方差的估计量,并在更强的信号条件下证明了一致性,同时给出了网络足够稠密时检验的局部功效。我们的理论结果基于经验特征向量与真实特征向量投影差的一个极限定理,该定理也可视为检验统计量的单样本版本,且可能具有独立意义。我们通过数值模拟和在美国航班数据上的应用展示了我们的结果。
In many settings one is often interested in determining whether two networks share some joint structural connectivity patterns such as communities. However, while communities may be shared across networks, edge probabilities may differ significantly. Therefore, in this paper we consider testing a general null hypothesis that two networks have the same underlying subspace, which in particular includes the setting that communities are the same for either stochastic blockmodels or mixed-membership stochastic blockmodels (even if edge probabilities are different). We propose a test statistic based on the Frobenius norm of the difference of the leading subspace projection matrices, and we prove that our test statistic, after appropriate centering and scaling, converges in distribution to a Gaussian random variable as long as the average expected degree grows at least logarithmically in the number of vertices. We then provide estimators for the asymptotic mean and variance and show consistency under a stronger signal condition, and we give the local power of our test when the networks are sufficiently dense. Our theoretical results are based on a limit theorem for the projection difference of empirical and true eigenvectors which can also be viewed as the one-sample version of our test statistic, and this result may be of independent interest. We demonstrate our results through numerical simulations and an application to US Flight data.
HDSense:一种有效的可观测灵敏度排序方法
Benoît Assi, Christian Bierlich, Rikab Gambhir, Phil Ilten, Tony Menzo, Stephen Mrenna, Manuel Szewc, Michael K. Wilkinson, Jure Zupan
AI总结 提出HDSense评分,利用一维直方图高效排序可观测集对模型参数的约束能力,通过Fisher信息框架剖析未知相关性,平衡信息量与冗余,验证于Lund弦碎裂模型参数估计。
在考虑许多相关可观测量的完整似然时,识别哪些可观测量最有效地约束模型参数可能在计算上代价高昂。这对于例如强子化模型尤为重要,因为需要高精度来解释对撞机实验结果。我们引入了高维灵敏度(HDSense)评分,这是一种仅使用一维直方图来对可观测量集进行排序的计算高效指标。该评分通过剖析Fisher信息框架中的未知相关性推导得出,平衡了总信息量与可观测量之间的冗余。我们将HDSense应用于对一组可观测量进行排序,以衡量它们对Pythia中实现的Lund弦强子化模型五个参数的约束能力,使用了在$Z$极点模拟的轻子对撞机事件。基于机器学习的全似然近似的验证表明,HDSense成功识别了接近最优的可观测量子集。该框架自然地处理来自不同接受度的多个实验的数据,并包含探测器效应。虽然在强子化模型上进行了演示,但该方法广泛适用于相关性未知或难以建模的通用参数估计问题。
Identifying which observables most effectively constrain model parameters can be computationally prohibitive when considering full likelihoods of many correlated observables. This is especially important for, e.g., hadronization models, where high precision is required to interpret the results of collider experiments. We introduce the High-Dimensional Sensitivity (HDSense) score, a computationally efficient metric for ranking observable sets using only one-dimensional histograms. Derived by profiling over unknown correlations in the Fisher information framework, the score balances total information content against redundancy between observables. We apply HDSense to rank a set observables in terms of their constraining power with respect to five parameters of the Lund string model of hadronization implemented in Pythia using simulated leptonic collider events at the $Z$ pole. Validation against machine-learning--based full-likelihood approximations demonstrates that HDSense successfully identifies near-optimal observable subsets. The framework naturally handles data from multiple experiments with different acceptances and incorporates detector effects. While demonstrated on hadronization models, the methodology applies broadly to generic parameter estimation problems where correlations are unknown or difficult to model.
基于设计的聚类大小不等分层随机试验的稳健估计与推断
Xinhe Wang, Ben B. Hansen
AI总结 针对聚类大小异质的分层随机试验,揭示分层平均估计量不一致性问题,提出Hájek比率估计量作为稳健替代,并开发基于设计的方差估计量。
聚类随机对照试验通常采用分层或配对匹配来改善协变量平衡和效率。样本平均处理效应(SATE)通常通过平均层内处理-对照均值对比来估计——这是一种自然且广泛使用的方法。我们证明,在聚类大小异质的分层聚类试验中,此类估计量不一定对SATE一致。即使随机化正确且模型无设定错误,它们也可能收敛到错误的极限。原因在于聚类大小与处理效应之间的协方差:按层平均会以产生常数阶偏差的方式错误加权聚类,无论样本量大小如何。我们研究Hájek(比率)估计量作为稳健替代。通过先聚合处理组内的结果再取差异,它在通过增加层大小或层数而扩大的聚类试验中保持一致性。尽管如此,其在聚类试验基于设计的分析中的应用一直受到缺乏方差估计量的限制。我们开发了一个基于设计的方差估计量,适用于任意数量和大小的层,并证明其渐近保守性,即使某些层仅包含一个处理或对照单元,该性质也成立。我们还提出了在聚类数量适中时改进Wald检验覆盖率的检验。该框架通过方差正交性质自然地扩展到协变量调整估计量。
Clustered randomized controlled trials are often stratified or pair-matched to improve covariate balance and efficiency. Sample average treatment effects (SATEs) are commonly estimated by averaging stratum-level treatment-control mean contrasts -- an approach that is natural and widely used. We show that, in stratified clustered trials with heterogeneous cluster sizes, such estimators need not be consistent for the SATE. They can converge to the wrong limit even under correct randomization and without model misspecification. The source is a covariance between cluster sizes and treatment effects: stratumwise averaging mis-weights clusters in a way that produces bias of constant order, regardless of sample size. We study the Hájek (ratio) estimator as a robust alternative. By aggregating outcomes within treatment groups before taking their difference, it remains consistent in clustered trials that grow by increasing strata sizes or the number of strata. Despite that, its use in design-based analyses of clustered trials has been limited by the lack of variance estimators. We develop a design-based variance estimator that applies to any number of strata of any size, and show that it is asymptotically conservative, a property that holds even when some strata contain only a single treated or control unit. We also present tests improving the coverage of Wald tests when the number of clusters is moderate. The framework extends naturally to covariate-adjusted estimators via a variance orthogonality property.
使用Bregman散度的稳健贝叶斯预测模型选择
Jongwoo Choi, Neil A. Spencer, Dipak K. Dey
AI总结 针对基于对数得分的ELPD对异常值和尾部不匹配敏感的问题,提出基于Bregman散度的广义ELPD框架,通过β-散度族控制低密度观测影响,实现稳健模型选择。
预测性贝叶斯模型比较通常依赖于留一法交叉验证准则,如期望对数预测密度(ELPD)。然而,由于ELPD基于对数得分,模型排名可能对异常值和尾部不匹配过于敏感。我们提出一个得分匹配的广义ELPD框架,用Bregman评分规则替换对数得分,通过广义后验更新模型参数并评估留一法预测效用。候选后验预测分布根据所选评分规则下的样本外效用进行排序,从而得到标准ELPD的直接正确得分推广。我们特别关注β-散度族,其中β控制预测比较对低密度观测的敏感性。在模型误设定下,该过程渐近选择预测分布与数据生成过程在所选Bregman散度下最接近的模型。模拟研究和微生物及法医数据应用表明,广义ELPD通过降低对低密度观测的敏感性可以改变所选模型。
Predictive Bayesian model comparison often relies on leave-one-out (LOO) cross-validation criteria such as the expected log predictive density (ELPD). However, model rankings can be overly sensitive to outliers and tail mismatch because ELPD is based on the log score. We propose a score-matched generalized ELPD framework that replaces the log score by a Bregman scoring rule to update model parameters through a generalized posterior and to evaluate LOO predictive utility. Candidate posterior predictive distributions are ranked by out-of-sample utility under the chosen scoring rule, yielding a direct proper-score generalization of standard ELPD. We focus especially on the $β$-divergence family, where $β$ controls the sensitivity of predictive comparison to low-density observations. Under model misspecification, the procedure asymptotically selects the model whose predictive distribution is closest to the data-generating process under the chosen Bregman divergence. A simulation study and applications to microbial and forensic data show that the generalized ELPD can change the selected model through reduced sensitivity to low-density observations.
非参数黎曼经验贝叶斯与流形上的测量去噪
Adam Quinn Jaffe, Leonardo V. Santoro, Bodhisattva Sen
AI总结 针对流形上潜变量与测量值的去噪问题,提出基于Tweedie-Eddington公式的切向贝叶斯去噪器,利用拉普拉斯-贝尔特拉米算子实现数据驱动近似,并证明其在低噪声下接近贝叶斯风险,但收敛速率慢于欧氏情形。
我们启动了在紧黎曼流形上潜变量及其测量值均位于流形上、似然为黎曼高斯分布的非参数经验贝叶斯去噪方法研究。起点是黎曼高斯混合模型的一个新颖的Tweedie-Eddington公式,该公式通过测量的边际分布识别出某个替代神谕去噪器;它通过一阶近似避免了显式计算后验弗雷歇均值(贝叶斯去噪器所需),因此我们称之为“切向”贝叶斯去噪器。我们证明该替代神谕在低噪声条件下几乎达到贝叶斯风险,利用拉普拉斯-贝尔特拉米算子的谱理论构建其完全数据驱动的近似,并建立替代神谕与其近似之间距离的有限样本收敛速率。与欧氏情形中近乎参数的速率相比,黎曼情形中的速率较慢,这是由于黎曼高斯密度在其弗雷歇均值的割迹处存在奇异性;在圆环的特殊情形下,我们建立了匹配的下界,表明所提出的去噪器是极小化最优的,并且去噪问题呈现出真正的非参数收敛速率。最后,我们将方法应用于两个科学问题:天文学中球面值伽马射线暴位置去噪,以及结构生物学中环面值蛋白质相邻氨基酸扭转角对(即拉马钱德兰图)去噪。
We initiate the study of nonparametric empirical Bayes denoising methods in the setting where both the latent variables and their measurements lie on a compact Riemannian manifold, and where the likelihood is a Riemannian Gaussian distribution. Our starting point is a novel Tweedie-Eddington formula for Riemannian Gaussian mixture models which identifies a certain surrogate oracle denoiser in terms of the marginal distribution of the measurements; it avoids the explicit computation of the posterior Fréchet mean (as required by the Bayes denoiser) via a first-order approximation, hence we refer to it as the "tangential" Bayes denoiser. We show that this surrogate oracle achieves nearly the Bayes risk in a low-noise regime, we construct a fully data-driven approximation of it using the spectral theory of the Laplace-Beltrami operator, and we establish finite-sample rates of convergence for the distance between the the surrogate oracle and its approximation. Contrasting the nearly-parametric rates from the Euclidean setting, the rates in the Riemannian setting are slower due to the singularities of the Riemannian Gaussian density at the cut locus of its Fréchet mean; in the special case of the circle we establish matching lower bounds which show that our proposed denoiser is minimax-optimal, and that the denoising problem exhibits a genuinely nonparametric rate of convergence. Lastly, we implement our methodology in two scientific applications: in astronomy, the sphere-valued problem of denoising the locations of gamma ray bursts; in structural biology, the torus-valued problem of denoising pairs of torsion angles of adjacent amino acids in a protein (i.e., the Ramachandran plot).
置信度、统计证据与相对信念及其在粒子物理问题中的应用
Michael Evans, Siqi Zheng
AI总结 本文提出相对信念推断方法,在泊松信号加背景模型中构建不确定性量化区间,并与Feldman-Cousins区间对比,满足似然排序和频率学派要求。
概率论通过证据原则,为未观测响应的事件提供了支持、反对或中立的证据的明确定义。这在执行适当的贝叶斯分析时立即适用。即使没有先验,这也对报告的推断施加了限制,因为这些推断需要反映似然排序。相对信念推断满足这一要求,并且当这些推断中的误差得到控制时,它们也满足重复抽样或频率学派的要求,例如达到给定的置信水平。本文在具有背景噪声的信号泊松模型背景下,考虑使用相对信念推断构建不确定性量化的区间。这些区间与针对该问题的著名的Feldman-Cousins区间进行了对比。
Probability theory provides a clear definition of what is meant by evidence in favor, against or none either way, of an event occurring for an unobserved response, via the principle of evidence. This is immediately applicable when carrying out a proper Bayesian analysis. Even without a prior, this imposes restrictions on reported inferences as these need to reflect the likelihood ordering. Relative belief inferences satisfy this requirement and, when the errors in these inferences are controlled, they also satisfy repeated sampling, or frequentist, requirements such as achieving given confidence levels. Relative belief inferences are considered here for the construction of intervals for uncertainty quantification in the context of a Poisson model for a signal with background noise. These intervals are contrasted with the well-known Feldman-Cousins intervals for this problem.
走向均值:一种实用的贝叶斯工作流,用于开发和部署临床预测模型
Mohsen Sadatsafavi, Richard D. Riley
AI总结 本文提出了一种实用的贝叶斯工作流,用于开发和部署临床预测模型,通过使用收缩先验和个体后验均值决策方法,提高了预测性能和不确定性量化。
临床预测模型为每个人提供预测(例如,估计风险),通常以点估计形式表达,来源于确定性函数如逻辑回归方程。此类'插件'预测隐藏了内在的不确定性。相比之下,贝叶斯方法提供了一种基于个体特定后验风险分布的不确定性量化机制。然而,由于感知的主观性、计算成本和实施复杂性,贝叶斯预测模型使用率较低。为此,我们提出了一种实用的贝叶斯流程,用于生成和部署预测模型。主要组成部分是(i)收缩先验,导致基于拉普拉斯/正态近似的回归系数后验分布,这避免了蒙特卡罗采样;以及(ii)使用个体的后验均值进行决策,这从期望效用视角得到支持。对于(i),我们建议具有互补特征的先验(简单性、用户输入、自动收缩)。对于(ii),我们建议计算后验均值的精确和近似方法,包括二次积分、麦克凯近似和投影预测映射的适应,从而创建一个简单的逻辑方程近似均值。通过示例和模拟,我们展示了贝叶斯工作流在预测性能上往往与插件预测相当或更好,同时能够通过合适覆盖的不确定性量化。在大多数模拟中,使用后验均值预测比插件预测在临床效用上更高,有时相当显著。总之,临床预测建模和部署的贝叶斯方法既实用又具有临床优势,因此高度推荐。
Clinical prediction models provide predictions for individuals, typically expressed as point estimates derived from a deterministic function, such as a logistic regression equation. Such 'plug-in' predictions hide inherent uncertainty. In contrast, Bayesian methods offer a coherent mechanism for uncertainty propagation, and allow the computation of the posterior mean as the measure of centrality of choice for clinical decision-making. However, Bayesian methods are not widely utilised in predictive analytics for healthcare. We investigated the feasibility and performance of a Bayesian adaptation of the commonly used frequentist framework for risk prediction modelling. We assessed (i) the use of shrinkage priors with complementary features (simplicity, user input, and automatic shrinkage) that enable Laplace/normal approximation of the posterior, and (ii) exact and approximate methods for efficient computation of the posterior mean. Using examples and simulations, we demonstrate that this Bayesian approach is feasible and improves predictive performance, while enabling uncertainty quantification with suitable coverage. In small-to-medium sample sizes, the gain in clinical utility by using the posterior mean over plug-in predictions was equivalent to the gain from using a noticeably larger sample size. Adapting the widely used parametric regression methods to an approximate Bayesian framework for prediction modelling is both pragmatic and clinically advantageous.
治疗后变量处理效应异质性的经验分层
Chao Cheng, Rui Wang, Yichi Zhang
AI总结 提出一种假设精简的经验分层框架,通过基于基线协变量预测的潜在治疗后变量响应定义经验得分,构建可识别的经验分层处理效应,并连接主分层因果效应。
治疗后变量(PVs),如治疗不依从、行为反应、中间事件,常常改变对主要结局的最终处理效应。然而,现有方法在研究中针对PVs的处理效应异质性方面提供的工具有限。传统的异质性处理效应估计量以基线协变量为条件。然而,类似地以观察到的PV为条件会引发处理效应估计的内生选择偏差。主分层为研究跨主分层的因果效应提供了严格的框架,但主分层是潜在的,其识别通常需要严格的假设。本文开发了一个假设精简的经验分层框架,用于表征针对PVs的处理效应异质性。我们使用基于基线协变量预测的潜在PV响应来定义经验得分,并利用经验得分构建经验上可访问的子组。由此产生的经验分层处理效应(ETEs)在标准因果假设下是可识别的。我们将所提出的框架与主分层联系起来,表明平均ETE在主忽略性假设下恢复了主因果效应,但在违反该假设时仍然具有信息量。我们进一步引入了投影ETE曲线,并开发了基于高效影响函数的半参数推断估计量。我们通过两个实际应用说明了所提出的框架。
Post-treatment variables (PVs), such as treatment noncompliance, behavioral responses, intercurrent events, often modify the ultimate treatment effect on the primary outcome. However, existing methods provide limited tools for studying treatment effect heterogeneity with respect to PVs. Conventional heterogeneous treatment effect estimands condition on baseline covariates. However, similarly conditioning on the observed PV can induce endogenous selection bias for the treatment effect estimation. Principal stratification offers a rigorous framework for studying principal causal effects across principal strata, but principal strata are latent and their identification often requires stringent assumptions. This paper develops an assumption-lean empirical stratification framework for characterizing treatment effect heterogeneity with respect to PVs. We define empirical scores using the predicted potential PV responses based on baseline covariates, and use the empirical scores to construct empirically accessible subgroups. The resulting empirical-stratum treatment effects (ETEs) are identifiable under standard causal assumptions. We connect the proposed framework to principal stratification by showing that the average ETE recovers principal causal effects under the principal ignorability assumption, but remains informative under violations of this assumption. We further introduce projected ETE curves and develop efficient influence function-based estimators for the semiparametric inference. We illustrate the proposed framework with two real-world applications.
LMT: 制造系统中文本告警记录的因果发现贝叶斯框架
Xiaofeng Xiao, Jianhong Chen, Qiuzhuang Sun, Naichen Shi, Xubo Yue
AI总结 提出LMT框架,结合大语言模型提取的语义信号和基于泊松过程的时间证据,通过贝叶斯方法从文本告警记录中发现因果图,在小样本场景下表现优异。
文本事件记录(如告警日志)已成为工程和制造系统中越来越常见的数据源。除了识别相关性或重复模式外,工程师通常有兴趣了解在系统运行过程中哪些类型的事件因果性地触发或影响其他事件。文本事件描述可能包含关于此类因果关系的语义线索,而最近的大语言模型(LLM)为提取这些信号提供了有前景的工具。然而,仅依赖LLM编码的文本信息不足以进行准确的因果发现,因为语义模式并不直接揭示因果机制,并且可能将因果关系与相关性或频繁的顺序模式混淆。为了解决这些挑战,我们提出了\textbf{LMT},一个用于工程事件数据的贝叶斯因果发现框架,它联合利用了文本描述和时间戳。具体来说,LMT首先使用LLM从事件描述中提取语义因果信号,并构建事件类型或事件簇之间因果图的先验分布。然后,它通过基于泊松过程的似然函数纳入时间证据,使得基于时间戳的统计证据能够精炼LLM信息先验。通过整合文本和时间信息,LMT生成一个既可解释又有数据支持的因果图。模拟研究表明,所提出的框架在不同设置下都是有效的,并且在样本量较小的告警事件场景中尤其具有优势。
Textual event records, such as alarm logs, have become an increasingly common data source in engineering and manufacturing systems. Beyond identifying correlations or recurring patterns, engineers are often interested in understanding which types of events causally trigger or influence other events during system operation. Textual event descriptions may contain semantic clues about such causal relationships, and recent large language models (LLMs) provide a promising tool for extracting these signals. However, relying solely on LLM-encoded textual information is insufficient for accurate causal discovery, since semantic patterns do not directly reveal causal mechanisms and may confuse causation with correlation or frequent sequential patterns. To address these challenges, we propose \textbf{LMT}, a Bayesian causal discovery framework for engineering event data that jointly leverages textual descriptions and timestamps. Specifically, LMT first uses LLMs to extract semantic causal signals from event descriptions and constructs a prior distribution over causal graphs among event types or event clusters. It then incorporates temporal evidence through a Poisson-process-based likelihood, allowing the LLM-informed prior to be refined by timestamp-based statistical evidence. By integrating the textual and temporal information, LMT produces a causal graph that is both interpretable and data-supported. Simulation studies show that the proposed framework is effective across different settings and is especially advantageous in small-sample alarm-event scenarios.
最小自由能随机化设计以改善协变量平衡
Haolin Chen, Jun Yu
AI总结 提出最小自由能随机化设计,通过平衡协变量与最大化熵的权衡,结合高效动态分配算法,提升统计效率与鲁棒性。
“分块你能分的,随机化你不能分的”是随机对照试验中处理效应估计的核心原则。尽管已经开发了丰富的分配策略,但分块实现的协变量平衡与随机化保证的鲁棒性之间的明确权衡很少被量化。受热力学第二定律的启发,本文提出一个新准则,即降低协变量不平衡的同时最大化量化对比和分配多样性的熵。由此推导出最优策略,称为最小自由能随机化设计,从而正式实现这种权衡。为了便于实际实施,我们进一步开发了一种计算高效的动态分配算法,并具有理论保证。通过有限样本方差分解,表明所提出的随机化策略能够控制协变量不平衡,同时防止未观测到的异质性主导均方误差,从而在规定的设计约束下保持极小极大效率。大量数值模拟表明,我们的方法比现有方法具有更优的统计效率和更强的鲁棒性。
``Block what you can and randomize what you cannot'' is the core principle for treatment effect estimation in randomized controlled trials. Although a wealth of allocation strategies has been developed, an explicit trade-off between the covariate balance achieved by blocking and the robustness guaranteed by randomization is seldom quantified. Motivated by the second law of thermodynamics, this work posits a new criterion that lowers the covariate imbalance while maximizing the entropy that quantifies contrast and allocation diversity. The resulting optimal strategy, termed the minimum free energy randomized design, is then derived, thereby formally achieving such a trade-off. To facilitate practical implementation, we further develop a computationally efficient dynamic allocation algorithm with theoretical guarantees. Using a finite-sample variance decomposition, the proposed randomization strategy is shown to control covariate imbalance while preventing unobserved heterogeneity from dominating the mean squared error, thus retaining minimax efficiency under the prescribed design constraints. Extensive numerical simulations demonstrate that our method achieves superior statistical efficiency and greater robustness than existing approaches.
一种估计量鲁棒的设计:用外部真实世界数据增强随机对照试验
Sky Qiu, Jens Tarp, Andrew Mertens, Mark van der Laan
AI总结 提出使用自适应目标最大似然估计(A-TMLE)结合匹配抽样策略,通过分解平均处理效应为合并效应和偏倚效应,并基于试验入组倾向分和外部数据倾向分进行匹配,提高估计鲁棒性和置信区间覆盖率。
用外部真实世界数据(RWD)增强随机对照试验(RCT)有可能提高处理效应估计量的有限样本效率。我们描述了使用自适应目标最大似然估计(A-TMLE)来估计平均处理效应(ATE),通过将ATE估计量分解为两个部分:一个结合了RCT和外部数据的合并ATE估计量,以及一个捕捉RCT入组对结果的条件效应的偏倚估计量。该方法将RCT数据视为参考,并纠正RCT与外部数据源之间的任何不一致性。鉴于现代电子健康记录中外部RWD的日益丰富,确定选择候选外部患者进行数据整合的最优策略仍然是一个开放但关键的问题。在这项工作中,我们首先研究A-TMLE估计量的鲁棒性,然后提出一种基于匹配的抽样策略,旨在提高估计量相对于目标估计量的鲁棒性。我们提出的策略是结果盲的,并基于两个一维分数进行匹配:试验入组分数和外部数据中的倾向分数。我们在模拟中证明,我们的抽样策略提高了A-TMLE产生的置信区间的覆盖率和窄度。我们通过使用Optum Clinformatics索赔数据库增强DEVOTE心血管安全性试验的案例研究来说明我们的方法。
Augmenting randomized controlled trials (RCTs) with external real-world data (RWD) has the potential to improve the finite sample efficiency of treatment effect estimators. We describe using adaptive targeted maximum likelihood estimation (A-TMLE) for estimating the average treatment effect (ATE) by decomposing the ATE estimand into two components: a pooled-ATE estimand that combines data from both the RCT and external sources, and a bias estimand that captures the conditional effect of RCT enrollment on the outcome. This approach views the RCT data as the reference and corrects for inconsistencies of any kind between the RCT and the external data source. Given the growing abundance of external RWD from modern electronic health records, determining the optimal strategy to select candidate external patients for data integration remains an open yet critical problem. In this work, we begin by studying the robustness property of the A-TMLE estimator and then propose a matching-based sampling strategy that attempts to improve the robustness of the estimator with respect to the target estimand. Our proposed strategy is outcome-blind and involves matching based on two one-dimensional scores: the trial enrollment score and the propensity score in the external data. We demonstrate in simulations that our sampling strategy improves the coverage and narrows the widths of confidence intervals produced by A-TMLE. We illustrate our method with a case study of augmenting the DEVOTE cardiovascular safety trial by using the Optum Clinformatics claims database.
面向高维离散数据的快速降维与聚类的数据压缩方法
Silvia D'Angelo, Michael Fop
AI总结 提出一种确定性降维框架,通过缩放位置编码的加权和将高维离散观测压缩为低维连续表示,保证单射性、近似高斯性及聚类中心可分离性,计算高效且适用于多种数据类型。
高维离散数据出现在许多当代应用中,包括基因组学、微生物组研究、调查研究以及数字行为分析。对此类数据进行聚类仍然具有挑战性,因为现有方法通常计算要求高、对稀疏性和离散性敏感,或针对特定数据类型设计。我们提出了一种用于聚类高维离散观测的确定性降维框架。该方法通过由缩放位置编码定义的加权和,将每个观测压缩为低维连续表示,产生一种适用于二值、分类和计数数据的数值稳定变换。我们建立了所提出压缩的几个理论性质。该映射是单射的,确保不同的观测在压缩后保持不同。在温和的正则条件下,压缩变量近似服从高斯分布,为压缩空间中的基于模型的聚类提供了理论基础。我们进一步证明,聚类中心之间的分离度在压缩下得以保持,这意味着降维后位置驱动的聚类结构仍然可识别。广泛的模拟研究表明,在多种现实场景下聚类恢复准确。所提出的方法计算效率高,与常用于聚类的降维技术相比,速度显著提升。对爱尔兰婴儿名字记录和微生物组数据的应用进一步说明了其实用性。该框架提供了一种可扩展、计算高效且广泛适用的高维离散数据聚类方法。
High-dimensional discrete data arise in many contemporary applications, including genomics, microbiome research, survey studies, and digital behavioral analysis. Clustering such data remains challenging because existing methods are often computationally demanding, sensitive to sparsity and discreteness, or designed for specific data types. We propose a deterministic dimension-reduction framework for clustering high-dimensional discrete observations. The method compresses each observation into a low-dimensional continuous representation through weighted sums defined by a scaled positional encoding, yielding a numerically stable transformation applicable to binary, categorical, and count-valued data. We establish several theoretical properties of the proposed compression. The mapping is injective, ensuring that distinct observations remain distinct after compression. Under mild regularity conditions, the compressed variables admit an approximate Gaussian representation, providing a theoretical basis for model-based clustering in the compressed space. We further show that separation between cluster centroids is preserved under compression, implying that location-driven cluster structure remains identifiable after dimension reduction. Extensive simulation studies demonstrate accurate cluster recovery across a wide range of realistic settings. The proposed approach is also computationally efficient, providing substantial speed improvements over commonly used dimension-reduction techniques often used in conjunction with clustering. Applications to Irish baby-name records and microbiome data further illustrate its practical utility. The proposed framework offers a scalable, computationally efficient, and broadly applicable approach to clustering high-dimensional discrete data.
具有数据自适应Wasserstein几何的分布鲁棒PCA
Chuang Xu, Andrew T. A. Wood, Yanrong Yang
AI总结 提出分布鲁棒PCA,通过数据自适应Wasserstein邻域最小化最坏情况重构风险,导出对偶问题并引入可替代目标函数,证明估计量的一致性及局部Grassmann渐近性。
我们提出了一种主成分分析的分布鲁棒公式,该公式最小化位于经验测度的Wasserstein邻域内的分布的最坏情况重构风险。Wasserstein邻域被视为分布的模糊集,通过传输矩阵$G$自适应校准以捕捉维度间的异质性不确定性。齐次情况($G$是单位矩阵的标量倍数)恢复经典PCA。在一般传输矩阵$G$下,我们推导了相关极小极大优化问题的对偶刻画,并引入一个可处理的替代目标函数,该函数由平方根经验重构误差加上依赖于几何的残差暴露惩罚项组成。精确估计量和替代估计量在总体PCA子空间上是一致的,并且在投影仪级别上渐近等价。传输几何允许数据自适应,而Wasserstein半径通过鲁棒Wasserstein轮廓推断进行校准,产生阶为$n^{-1/2}$的数据驱动半径。建立了全面的理论保证,包括一致性和局部Grassmann渐近性,其中显式Wasserstein诱导的漂移由极限传输几何和校准水平决定。数值实验和实际数据应用表明,所提方法在结构化协方差偏移、适度污染和某些同分布情况下能显著改善有限样本的样本外性能。
We develop a distributionally robust formulation of principal component analysis that minimizes worst-case reconstruction risk over distributions lying within a Wasserstein neighborhood of the empirical measure. The Wasserstein neighborhood, viewed as an ambiguity set of distributions, is adaptively calibrated through a transport matrix $G$ to capture heterogeneous uncertainty across dimensions. The homogeneous case, in which G is a scalar multiple of the identity matrix, recovers classical PCA. Under a general transport matrix G, we derive a dual characterization of the associated minimax optimization problem and introduce a tractable surrogate objective function consisting of the square-root empirical reconstruction error plus a geometry-dependent residual exposure penalty. The exact and surrogate estimators are shown to be consistent for the population PCA subspace and asymptotically equivalent at the projector level. The transport geometry is allowed to be data adaptive, while the Wasserstein radius is calibrated via robust Wasserstein profile inference, yielding a data-driven radius of order $n^{-1/2}$. Comprehensive theoretical guarantees are established, including consistency and local Grassmannian asymptotics exhibiting an explicit Wasserstein-induced drift determined by the limiting transport geometry and calibration level. Numerical experiments and a real-data application demonstrate that the proposed method can substantially improve finite-sample out-of-sample performance under structured covariance shifts, moderate contamination, and certain same-distribution regimes.
具有对称性的凸收缩方法用于高维协方差估计
Mitchell A. Thornton
AI总结 本文提出了一种数据自适应的收缩估计器,用于高维协方差估计,通过在有限对称群下使用雷诺德投影作为收缩目标,并结合结构化目标和自适应凸收缩方法,提高了估计精度。
我们开发了一类数据自适应的收缩估计器,用于高维协方差估计,其中收缩目标是样本协方差在有限对称群下的雷诺德投影,该对称群从候选库中通过留出预测性能选择。该类方法扩展了Ledoit和Wolf的凸收缩估计器,通过将标量-恒等目标替换为由对称群导出的结构化目标,并扩展了Shah和Chandrasekaran的组对称最大似然估计器,通过结合结构化目标与自适应凸收缩方法,并从数据中选择群而非预设。一个双层过程执行群选择:基于留出负对数似然的通用候选评估,可选地在域特定步骤后进行,该步骤从结构先验构建候选库。我们建立了留出校准的凸组合权重的有限样本遗憾界,建立了数据驱动的群选择的Oracle不等式,并建立了一个定量充分匹配条件,该条件下所提估计器在Frobenius均方误差上优于Ledoit-Wolf收缩。该过程在六个现实数据问题上进行了演示,涵盖金融(S&P~500每日回报)、气候(NOAA OISST海面温度异常)、基因组学(TCGA-BRCA基因表达)、无线电信号处理(RadioML 2018.A)、天文学成像(Galaxy10 DECaLS)和自然图像块(CIFAR-10与CIFAR-10.1分布偏移配对)。还与Chojecki等人提出的贝叶斯排列对称估计器进行了实证比较。在非小样本领域,即结构先验携带最多信息每观察时,Ledoit-Wolf收缩仍是适当的基线。
We develop a class of data-adaptive shrinkage estimators for high-dimensional covariance estimation in which the shrinkage target is a Reynolds projection of the sample covariance under a finite symmetry group selected from a candidate library by held-out predictive performance. The class generalizes the convex shrinkage estimator of Ledoit and Wolf by replacing the scalar-identity target with a structured target derived from a symmetry group when one is available, and generalizes the group-symmetric maximum-likelihood estimator of Shah and Chandrasekaran by combining structural targeting with adaptive convex shrinkage and by selecting the group from data rather than treating it as prespecified. A two-tier procedure performs the group selection: a universal per-candidate evaluation based on held-out negative log-likelihood, optionally preceded by a domain-specific step that constructs the candidate library from structural priors. We establish a finite-sample regret bound for the held-out calibration of the convex combination weight, an oracle inequality for the data-driven group selection, and a quantitative sufficient-match condition under which the proposed estimator dominates Ledoit-Wolf shrinkage in Frobenius mean-squared error. The procedure is illustrated on six real-data problems spanning finance (S&P~500 daily returns), climate (NOAA OISST sea-surface temperature anomalies), genomics (TCGA-BRCA gene expression), radio signal processing (RadioML 2018.A), astronomical imaging (Galaxy10 DECaLS), and natural image patches (CIFAR-10 with a CIFAR-10.1 distribution-shift companion). An empirical comparison is also made against the Bayesian permutation-symmetry estimator of Chojecki and colleagues. Outside the few-shot regime, where structural priors carry the most information per observation, Ledoit-Wolf shrinkage remains the appropriate baseline.
Wasserstein空间中局部土壤侵蚀分布的空间预测
Jiaming Qiu, Xiongtao Dai, Zhengyuan Zhu, Shuiqing Yin
AI总结 提出一种将局部侵蚀分布视为Wasserstein空间对象,通过基展开和多元随机场建模,结合局部回归和克里金法进行空间预测的新方法,在模拟和陕西省实际数据中优于现有方法。
获取精确的侵蚀测量需要昂贵的实地工作,使得直接调查大范围区域(如省或流域)不可行。为了将实地结果扩展到如此广阔的区域,我们提出了一种新颖的空间预测方法,将局部侵蚀分布视为Wasserstein空间中的对象。这些分布被映射为平方可积轨迹,并通过基展开表示,形成捕捉空间依赖性的多元随机场。通过在这种表示中应用局部回归和克里金法,我们的方法灵活地建模和预测任意位置的侵蚀分布。该框架改进了对分布泛函(如均值和超越概率)的预测。模拟研究表明,所提出的方法优于错误指定的参数替代方法和现有的Fréchet回归方法。我们通过中国陕西省的详细侵蚀分析说明了该方法,其中将来自调查流域的局部测量结果扩展到使用土地利用和海拔等协变量预测整个省的侵蚀分布。
Obtaining precise erosion measurements requires costly fieldwork, making it infeasible to directly survey large domains such as a province or river basin. To extend fieldwork results across such extensive domains, we propose a novel spatial prediction method that treats local erosion distributions as objects in the Wasserstein space. These distributions are mapped into square-integrable trajectories and represented via basis expansion, forming a multivariate random field that captures spatial dependence. By applying local regression and Kriging in this representation, our approach flexibly models and predicts erosion distributions at arbitrary locations. This framework improves prediction for functionals of the distribution, such as the mean and exceedance probabilities. Simulation studies demonstrate that the proposed method outperforms a misspecified parametric alternative and existing Fréchet regression approaches. We illustrate the approach with a detailed erosion analysis in Shaanxi province, China, where local measurements from surveyed watersheds are extended to predict erosion distributions across the entire province using covariates such as land use and elevation.
高频风矢量时间序列的随机天气生成器
Mingshi Cui, Kevin Eng, Justin T. Greene, Zern Ke, Abolfazl Sodagartojgi, Zhiqiu Xia, Gemma E. Moran, Michael L. Stein
AI总结 针对分钟级风矢量时间序列,开发基于时间矢量量化变分自编码器的机器学习模型,生成逼真序列,捕捉昼夜变化但极端风速分布匹配不足。
地表风速在分钟尺度上变化显著,因此有必要研究其在此精细时间尺度上的变化。为最小化季节性影响,本文限定于六月,基于俄克拉荷马州拉蒙特站点超过30年的分钟级高质量测量数据,开发了一系列用于生成真实地表风矢量时间序列的机器学习模型。此类生成器可作为多种学科模型的输入,特别是风能领域,同时也适用于野火蔓延和航空等。数据显示风速和风向均存在复杂的昼夜结构,标准时间序列模型难以捕捉,因此我们考虑多种机器学习方法,基于时间矢量量化变分自编码器构建随机风生成器。我们考虑一次生成一天的数据,以及基于前一天风况生成一天的风矢量。我们还研究了在生成器中纳入离散天气状态变量的方法。我们使用多种正式和非正式方法评估生成器。其中最佳生成器能够捕捉观测数据中的许多(但非全部)复杂特征。特别地,我们的最佳方法准确模拟了风波动性的昼夜变化,但在匹配观测到的极端风速分布方面存在困难。
Surface winds can vary substantially from one minute to the next, so there is scope for studying its variation on this fine time scale. Restricting to the month of June to minimize seasonality, this work develops a range of machine learning models for generating realistic time series of surface wind vectors at a site in Lamont, Oklahoma based on more than 30 years of high quality measurements at the minute time scale. Such a generator could be used as an input into models from a range of disciplines, notably for wind energy, but also wildfire spread and aviation, among others. The data show complex diurnal structures in both wind speed and direction that would be challenging to capture with standard time series models, so we consider a number of machine learning approaches to producing a stochastic wind generator based on time vector-quantized variational autoencoders. We consider generating a day's worth of data at a time and generating a day of wind vectors conditional on the previous day's winds. We also study methods for incorporating a discrete weather state variable in the generator. We evaluate the generators using a wide range of formal and informal methods. The best of these generators can capture many but not all of the complex features present in the observational data. In particular, the best of our approaches accurately mimic diurnal changes in wind volatility but struggle to match the observed distribution of extreme wind speeds.
留出一个窗口:修改刀切法用于时间序列的预测推断
Hanyang Jiang, Rina Foygel Barber, Ashwin Pananjady, Yao Xie
AI总结 针对时间序列中数据非可交换性和记忆预测器的问题,提出留出一个窗口(LWO)方法,通过修改刀切法实现有效覆盖,并产生比分裂共形预测更窄的区间。
共形预测方法在数据可交换且预测器以无记忆方式训练时,具有强大的理论和经验预测推断性能。然而,这些假设和约束在许多真实数据场景中不切实际,例如时间序列(其中时间依赖性违反了可交换性,并且无记忆预测器不可避免地具有较差的预测准确性)。最近的研究表明,分裂共形预测方法对于记忆预测器和偏离可交换性(这是时间序列数据的常见特征)具有鲁棒性。然而,由于使用样本分裂可能导致较低的准确性,这促使我们探究其他不依赖数据分裂的预测推断方法是否也能可靠地用于时间序列设置。在这项工作中,我们表明即使在具有轻微时间依赖性的典型时间序列模型中,原始的留一刀切法也可能遭受任意的覆盖损失。作为补救措施,我们提出了一种针对此类设置的精心修改,称为留出一个窗口(LWO)方法,并表明只要模型拟合过程满足温和的稳定性条件,它就能实现有效的覆盖。我们的证明基于量化数据偏离循环可交换性的程度,并引入了新的系数来衡量这种偏离的程度。在时间序列数据上的实验表明,当原始刀切法无法覆盖时,我们的LWO方法通常能实现有效的覆盖,同时产生比分裂共形预测更窄的区间。
Conformal prediction methods enjoy strong theoretical and empirical predictive inference performance, provided the data is exchangeable and is treated symmetrically during training. However, these assumptions are impractical in many settings, such as time series, where temporal dependence violates exchangeability and it is preferable to use predictors that leverage dependence by treating data asymmetrically. Recent work shows that split conformal prediction is robust to these issues, but sample splitting can reduce accuracy, motivating the study of methods that do not rely on data splitting in the time series setting. In this work, we show that the vanilla leave-one-out jackknife can suffer arbitrary loss of coverage even in canonical time series models with mild temporal dependence. As a remedy, we propose a modification tailored to such settings, which we term the leave-a-window-out (LWO) method, and show that it can achieve valid coverage provided that the model-fitting procedure satisfies mild stability properties. Our proofs are based on quantifying the degree to which the data departs from cyclic exchangeability, which we introduce new coefficients to measure. Experiments on time series demonstrate that our method often enjoys valid coverage when the vanilla jackknife fails to cover, while producing much narrower intervals than split conformal prediction.
正定矩阵锥上强混合时间序列的Wishart核密度估计
Léo R. Belzile, Christian Genest, Frédéric Ouimet, Donald Richards
AI总结 提出Wishart核密度估计器用于正定矩阵锥上的密度估计,该估计器具有边界感知性,能缓解边界偏差,并在混合条件下建立了均方误差、一致强相合性和渐近正态性,模拟和实例表明其优于其他方法。
在正定矩阵锥上引入了一种Wishart核密度估计器用于密度估计。该估计器具有边界感知性,减轻了传统核密度估计器遭受的边界偏差,同时易于实现。在Lebesgue测度和适当的混合条件下,建立了其均方误差、在扩张紧集上的一致强相合性以及渐近正态性。这项工作是在任何度量下对该空间上相依数据进行密度估计的首项研究。对于独立观测,还导出了平均绝对误差的渐近上界。一项模拟研究将Wishart核密度估计器与log-Gaussian核密度估计器(另一种基于Schwartzman [Int. Stat. Rev., 2016, 84(3), 456--486]提出的矩阵变量对数正态分布的边界感知估计器)以及环境欧氏空间上的朴素高斯核密度估计器进行了性能比较。在估计Wishart自回归过程的平稳边际密度时,针对多个自回归系数矩阵和新息协方差矩阵,Wishart核密度估计器表现出最佳的整体准确性和稳定性。通过估计Amazon Corp.股票和标准普尔500交易所交易基金5分钟日内收益计算的已实现协方差矩阵一年时间序列的边际密度,说明了Wishart核密度估计器的实际效用。所有代码均通过R包ksm公开提供,以促进该方法的实施和结果的可重复性。
A Wishart kernel density estimator (KDE) is introduced for density estimation in the cone of positive definite matrices. The estimator is boundary-aware and mitigates the boundary bias suffered by conventional KDEs, while remaining simple to implement. Its mean squared error, uniform strong consistency on expanding compact sets, and asymptotic normality are established under the Lebesgue measure and suitable mixing conditions. This work represents the first study of density estimation for dependent data on this space under any metric. For independent observations, an asymptotic upper bound on the mean absolute error is also derived. A simulation study compares the performance of the Wishart KDE with that of the log-Gaussian KDE, another boundary-aware estimator based on the matrix-variate lognormal distribution proposed by Schwartzman [Int. Stat. Rev., 2016, 84(3), 456--486], and with the naive Gaussian KDE on the ambient Euclidean space. When estimating the stationary marginal density of a Wishart autoregressive process for several autoregressive coefficient matrices and innovation covariance matrices, the Wishart KDE exhibits the best overall accuracy and stability. The practical utility of the Wishart KDE is illustrated by estimating the marginal density of a one-year time series of realized covariance matrices computed from 5-minute intra-day returns on Amazon Corp. shares and on the Standard & Poor's 500 exchange-traded fund. All code is publicly available via the R package ksm to facilitate implementation of the method and reproducibility of the findings.
一种增强贝叶斯VAR与非线性因子的灵活方法
Todd Clark, Florian Huber, Gary Koop
AI总结 本文提出一种用回归树非参数建模非线性因子的向量自回归模型,通过因子方法简洁建模非线性,避免误设,实现高效贝叶斯计算,并适用于结构冲击识别。
本文提出了一种向量自回归模型,该模型通过回归树非参数地建模非线性因子。我们的模型有四个主要优点。第一,因子方法的使用确保了非线性偏离被简洁地建模。特别是,它们表现出功能池化,即使用少量非线性因子来建模变量间的共同非线性。第二,非参数地建模潜在非线性降低了误设的风险。第三,即使在非常高维的模型中,使用MCMC的贝叶斯计算也是直接的,允许高效的逐方程估计,从而避免了诸如时变参数VAR等流行替代方法中出现的计算瓶颈。第四,现有的线性因子模型中识别结构性经济冲击的方法可以通过我们的模型直接适用于非线性情况。涉及人工数据和宏观经济数据的实验说明了我们模型的性质及其在预测和结构性经济分析中的有用性。
This paper proposes a vector autoregression augmented with nonlinear factors that are modeled nonparametrically using regression trees. There are four main advantages of our model. First, the use of factor methods ensures that departures from linearity are modeled parsimoniously. In particular, they exhibit functional pooling where a small number of nonlinear factors are used to model common nonlinearities across variables. Second, modeling potential nonlinearities nonparametrically lessens the risk of misspecification. Third, Bayesian computation using MCMC is straightforward even in very high-dimensional models, allowing for efficient, equation-by-equation estimation, thus avoiding computational bottlenecks that arise in popular alternatives such as the time-varying parameter VAR. Fourth, existing methods for identifying structural economic shocks in linear factor models can be adapted for the nonlinear case in a straightforward fashion using our model. Exercises involving artificial and macroeconomic data illustrate the properties of our model and its usefulness for forecasting and structural economic analysis.
可解释的深度卷积模型用于复杂系统中的非线性多元时间序列
Domjan Baric, Davor Horvatic
AI总结 提出DCIts架构,通过分解为Focuser和Modeler组件,实现非线性多元时间序列的局部可解释交互结构学习,在保持预测精度的同时恢复稳定的符号化滞后交互模式。
我们介绍了深度卷积时间序列解释器(DCIts),这是一种用于非线性多元时间序列的深度学习架构,能够提供样本特定、局部可解释的底层交互结构描述。与标准的黑箱预测器不同,DCIts学习一个时间和滞后依赖的转移张量,该张量被显式分解为两个组件:Focuser通过稀疏掩码机制选择相关的源序列和时间滞后,Modeler为这些选定的交互分配符号系数。这种分解为每个预测实例产生局部滞后邻接结构和符号化的源-滞后贡献,从而能够直接检查有效连接;当高阶分支被激活时,同一框架产生阶数分辨的元素级多项式贡献。在架构上,DCIts使用多样化的卷积滤波器库来捕获时间和跨变量依赖关系,这些依赖关系通过瓶颈网络映射到转移张量。在具有已知交互结构的受控基准数据集上,我们证明DCIts在实现竞争性预测误差(相对于强可解释基线)的同时,恢复了稳定的、符号化的、滞后分辨的交互模式。因此,该框架优先考虑内在可解释性,将预测准确性作为忠实性约束而非唯一目标。
We introduce the Deep Convolutional Interpreter for Time Series (DCIts), a deep-learning architecture for nonlinear multivariate time series that provides sample-specific, locally interpretable descriptions of the underlying interaction structure. Unlike standard black-box forecasters, DCIts learns a time- and lag-dependent transition tensor explicitly factorized into two components: a Focuser, which selects relevant source series and time lags via a sparse masking mechanism, and a Modeler, which assigns signed coefficients to these selected interactions. This decomposition yields a local lag-adjacency structure and signed source-lag contributions for every forecast instance, enabling direct inspection of effective connectivity; when higher-order branches are activated, the same framework yields order-resolved elementwise polynomial contributions. Architecturally, DCIts uses a diverse bank of convolutional filters to capture temporal and cross-variable dependencies, which are mapped through a bottleneck network to the transition tensor. On controlled benchmark datasets with a known interaction structure, we demonstrate that DCIts achieves competitive forecasting error relative to a strong interpretable baseline while recovering stable, signed, lag-resolved interaction patterns. The framework thus prioritizes intrinsic interpretability, using forecasting accuracy as a faithfulness constraint rather than the sole objective.
泊松点过程的多变点检测
C. Dion-Blanc, D. Hawat, E. Lebarbier, S. Robin
AI总结 针对非齐次或标记泊松过程数据,提出基于最小对比度估计的离线多变点检测方法,通过交叉验证选择变点数量,并推广至自激励过程。
变点检测的目标是识别时间序列数据中的行为转变。本文关注数据源自非齐次泊松过程或标记泊松过程的情形。我们提出了一种使用最小对比度估计来检测多个离线变点的方法。具体来说,我们解决了在给定离散观测的情况下如何管理过程的连续性质。此外,我们通过交叉验证程序选择适当的变点数量,该程序鉴于泊松过程的特性特别有效。最后,我们展示了如何将此方法用于强度变化的自激励过程。通过模拟和真实数据集的实验,我们展示了所提出方法的优势,该方法已在R包中实现。
The aim of change-point detection is to identify behavioral shifts within time series data. This article focuses on scenarios where the data is derived from an inhomogeneous Poisson process or a marked Poisson process. We present a methodology for detecting multiple offline change-points using a minimum contrast estimator. Specifically, we address how to manage the continuous nature of the process given the available discrete observations. Additionally, we select the appropriate number of changes via a cross-validation procedure which is particularly effective given the characteristics of the Poisson process. Lastly, we show how to use this methodology for self-exciting processes with changes in the intensity. Through experiments, with both simulated and real datasets, we showcase the advantages of the proposed method, which has been implemented in the R package.
基于非负矩阵分解的部分观测时间序列预测
Yohann de Castro, Luca Mencarelli
AI总结 提出滑动掩码方法(SMM)结合非负矩阵补全进行多非负时间序列预测,通过掩码原型矩阵分解(mAMF)和掩码归一化非负矩阵分解(mNMF)实现,理论证明恢复误差与噪声成比例,实验优于Transformer、LSTM等方法。
在现代时间序列问题中,我们旨在预测可能包含缺失值和噪声的多重时间序列。本文引入滑动掩码方法(SMM),通过非负矩阵补全来预测多个非负时间序列:将观测到的噪声值和预测/缺失值收集成矩阵形式,并通过将其行表示为少量非负向量(称为原型)的凸组合来实现学习。我们提出了两种估计方法,掩码原型矩阵分解(mAMF)和掩码归一化非负矩阵分解(mNMF),它们可以与SMM方法结合。我们证明这些估计能以与噪声成比例的误差恢复真实原型。我们使用近端交替线性化方法(PALM)来计算原型和凸组合权重。我们在真实数据上将我们的估计器与最先进的方法(Transformer、LSTM、SARIMAX...)进行了多时间序列预测比较,结果表明我们的方法在大多数实验中优于它们。
In modern time series problems, one aims at forecasting multiple time series with possible missing and noisy values. In this paper, we introduce the Sliding Mask Method (SMM) for forecasting multiple nonnegative time series by means of nonnegative matrix completion: observed noisy values and forecast/missing values are collected into matrix form, and learning is achieved by representing its rows as a convex combination of a small number of nonnegative vectors, referred to as the archetypes. We introduce two estimates, the mask Archetypal Matrix factorization (mAMF) and the mask normalized Nonnegative Matrix Factorization (mNMF) which can be combined with the SMM method. We prove that these estimates recover the true archetypes with an error proportional to the noise. We use a proximal alternating linearized method (PALM) to compute the archetypes and the convex combination weights. We compared our estimators with state-of-the-art methods (Transformers, LSTM, SARIMAX...) in multiple time series forecasting on real data and obtain that our method outperforms them in most of the experiments.
非线性估计器:用于参数学习的双贝叶斯仿射估计器
Sasan Vakili, Daniël Woonings, Pradyumna Paruchuri, Peyman Mohajerin Esfahani
AI总结 提出一种用于Wiener型状态空间模型的非线性参数估计器,通过固定点架构耦合两个仿射最小均方误差估计器,分别估计未知参数和潜在变量,并开发两种双估计器框架,实验表明双状态-参数估计器在参数均方误差上优于其他方法。
本文提出一种用于Wiener型状态空间模型的非线性参数估计器,该估计器采用固定点架构,耦合两个仿射最小均方误差(MMSE)估计器:一个用于未知参数,另一个用于潜在变量。该架构保留了最优仿射MMSE参数估计器的功能结构,同时引入了动态基统计(DBS)估计,以总结非线性基函数评估。开发了两种DBS构建策略,从而产生两种非线性估计器框架。双基-参数估计器将仿射基估计器与仿射参数估计器相结合,而双状态-参数估计器首先计算仿射状态估计及其协方差,然后通过高斯DBS算子映射这些状态估计统计量以获得DBS估计。两种双估计器都采用固定点表征,交替估计每个分量,使用另一个分量的更新先验(该先验来自前一次迭代中该分量的插件估计统计量)。通过广泛的蒙特卡洛实验检验了所提方法的有效性,结果表明双基-参数估计器获得的参数均方误差与纯仿射参数估计器相当,而双状态-参数估计器实现了最低的参数均方误差,优于双基-参数估计器、纯仿射参数估计器以及经典粒子吉布斯和期望最大化方案的顺序蒙特卡洛变体。
This paper presents a nonlinear parameter estimator for Wiener-type state-space models obtained as a fixed-point architecture that couples two affine minimum mean-squared error (MMSE) estimators: one for the unknown parameters and one for latent variables. The architecture retains the functional structure of the optimal affine MMSE parameter estimator while incorporating Dynamic Basis Statistics (DBS) estimates that summarize nonlinear basis-function evaluations. Two DBS construction strategies are developed, leading to two nonlinear estimator frameworks. The dual basis-parameter estimator combines an affine basis estimator with the affine parameter estimator, whereas the dual state-parameter estimator first computes affine state estimates and their covariances, then maps these state-estimate statistics through a Gaussian DBS operator to obtain DBS estimates. Both dual estimators admit fixed-point characterizations that alternate between estimating each component using the updated prior of the other, obtained from that component's plug-in estimate statistics from the previous iteration. The efficacy of the proposed methods is examined via extensive Monte Carlo experiments, showing that the dual basis-parameter estimator attains parameter mean-squared errors comparable to those of the purely affine parameter estimator, while the dual state-parameter estimator achieves the lowest parameter mean-squared error, outperforming both the dual basis-parameter and purely affine parameter estimators, as well as sequential Monte Carlo variants of classical Particle Gibbs and Expectation-Maximization schemes.
局部驯化随机梯度朗之万动力学的确定性分母设计
Yiwei Zhou, Ziheng Chen
AI总结 针对驯化随机梯度朗之万动力学中分母设计问题,提出基于代理分数和分位数的确定性分母方法,避免随机分母的均值偏移,实验表明其性能接近理想情况。
驯化随机梯度朗之万动力学通过向更新中添加分母来稳定大漂移。如果该分母使用与更新步骤相同的随机梯度样本,它也会改变条件均值漂移。我们研究确定性分母:状态依赖的包络在抽取当前预言机样本之前固定。主要问题是如何在实践中设计这个包络。设计从预言机分数开始,在试点状态上构建低成本代理分数,通过经验分位数选择激活阈值,然后应用一个小校准层。分析跟踪三个步骤:代理和阈值误差变为包络误差;包络误差扰动一个SGLD步骤;局部残差通过条件扰动桥给出平稳误差。实验表明,代理分位数分母接近预言机分数行为,避免了随机分母均值偏移通道,并改进了简单的确定性驯化选择。
Tamed stochastic-gradient Langevin dynamics (SGLD) stabilizes large drifts by adding a denominator to the update. If this denominator uses the same stochastic-gradient sample as the update step, it can also change the conditional mean drift. We study deterministic denominators: the state-dependent envelope is fixed before the current oracle sample is drawn. The main question is how to design this envelope in practice. The design starts from an oracle score, builds a low-cost proxy score on pilot states, chooses activation thresholds by empirical quantiles, and then applies a small calibration layer. The analysis tracks three steps: proxy and threshold errors become envelope errors; envelope errors perturb one SGLD step; and the local residuals give stationary errors through a conditional perturbation bridge. Experiments show that the proxy-quantile denominators are close to oracle-score behavior, avoid the random-denominator mean-shift channel, and improve simple deterministic taming choices.
基于潜扩散模型参数化的地下流体数据同化:集成卡尔曼与蒙特卡洛技术的性能
Guido Di Federico, Wenchao Teng, Louis J. Durlofsky
AI总结 针对地下流体数据同化中高维参数反演问题,比较了基于潜扩散模型(LDM)的集成卡尔曼方法(ESMDA)与蒙特卡洛方法(MCMC/SMC)在三维河道地质模型上的性能,发现蒙特卡洛方法在保持地质真实性的同时能更有效地降低数据失配和不确定性。
地下流体数据同化(DA)涉及校准模型参数以匹配观测数据(通常来自井),同时保持地质真实性。潜扩散模型(LDM)提供了从高维地质模型空间到低维潜变量的高效映射,降低了反问题的维度,同时保持了后验地质模型的合理性。然而,LDM映射的高度非线性可能会降低基于卡尔曼增益的集成更新的性能。我们针对具有层次地质不确定性的三维河道地质模型,系统比较了DA算法。我们使用多重数据同化集成平滑器(ESMDA)比较了模型空间和潜空间的DA,并展示了一个关键权衡:模型空间更新实现了显著的不确定性降低,但产生了地质上不现实的后验模型,而潜空间更新保持了真实性但表现出有限的不确定性降低。受此启发,我们在3D-LDM潜空间中探索了严格的马尔可夫链蒙特卡洛(MCMC)和序贯蒙特卡洛(SMC)算法。为适应其高计算需求,我们开发了一个快速代理流模型来近似井响应。MCMC和SMC在三个合成测试案例中与ESMDA进行了评估,DA在LDM潜空间中执行。由于LDM参数化,所有模型都保持了地质真实性。MCMC和SMC彼此一致,并且比潜空间ESMDA实现了更低的数据失配和更多的不确定性降低。我们的总体结果表明,集成卡尔曼方法在高度非线性参数化下可能提供过高的后验不确定性,而由快速代理模型支持的严格蒙特卡洛采样可以提供更可靠的替代方案。
Data assimilation (DA) in subsurface flow entails calibrating model parameters to match observed data, typically at wells, while preserving geological realism. Latent diffusion models (LDMs) provide efficient mappings from high-dimensional geological model space to a low-dimensional latent variable, reducing the dimensionality of the inverse problem while maintaining plausibility in posterior geomodels. However, the high nonlinearity in the LDM mapping may degrade the performance of Kalman-gain-based ensemble updates. We present a systematic comparison of DA algorithms applied to large-scale 3D channelized geomodels with hierarchical geological uncertainty. We compare model-space and latent-space DA using the ensemble smoother with multiple data assimilation (ESMDA), and demonstrate a key trade-off: model-space updates achieve significant uncertainty reduction but produce geologically unrealistic posterior models, while latent-space updates preserve realism but exhibit limited uncertainty reduction. Motivated by this, we explore rigorous Markov chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) algorithms in the 3D-LDM latent space. To accommodate their high computational demands, we develop a fast surrogate flow model that approximates well-rate responses. MCMC and SMC are evaluated against ESMDA across three synthetic test cases, with DA performed in the LDM latent space. All models maintain geological realism due to the LDM parameterization. MCMC and SMC are consistent with one another and achieve lower data mismatch and more uncertainty reduction than latent-space ESMDA. Our overall results demonstrate that ensemble Kalman methods may provide overestimated posterior uncertainty with highly nonlinear parameterizations, while rigorous Monte Carlo sampling, enabled by fast surrogate models, can provide a more reliable alternative.
基于随机最优控制的稀有事件分析
Yuanqi Du, Jiajun He, Dinghuai Zhang, Eric Vanden-Eijnden, Carles Domingo-Enrich
AI总结 提出将稀有事件分析中的committor函数估计转化为随机最优控制问题,通过反馈控制引导轨迹采样,并开发两种损失函数及处理亚稳态的方法,在基准系统上获得更准确的结果。
稀有事件,如生物分子的构象变化、相变和化学反应,是许多物理系统行为的关键,但由于无偏模拟很少产生这些事件,因此计算研究极其困难。过渡路径理论(TPT)为分析此类事件提供了严格的统计框架:它表征了两个指定亚稳态(反应物和产物)之间的反应轨迹集合,其核心对象——committor函数(给出系统下一步到达产物而非反应物的概率)——编码了所有基本的动力学和热力学信息。我们引入了一个框架,将committor估计转化为随机最优控制(SOC)问题。在此公式中,committor定义了一个反馈控制(与其对数梯度成正比),该控制主动引导轨迹朝向反应区域,从而实现对反应路径的高效采样。为了解决由此产生的命中时间控制问题,我们开发了两个互补的目标:直接反向传播损失和基于原理的离策略值匹配损失,并为其建立了一阶最优性保证。我们进一步通过引入一种替代采样过程来解决亚稳态问题(该问题可能使受控轨迹陷入中间势阱),该过程在降低有效能垒的同时保持反应电流。在基准系统上,该框架比现有方法产生了显著更准确的committor估计、反应速率和平衡常数。
Rare events such as conformational changes in biomolecules, phase transitions, and chemical reactions are central to the behavior of many physical systems, yet they are extremely difficult to study computationally because unbiased simulations seldom produce them. Transition Path Theory (TPT) provides a rigorous statistical framework for analyzing such events: it characterizes the ensemble of reactive trajectories between two designated metastable states (reactant and product), and its central object--the committor function, which gives the probability that the system will next reach the product rather than the reactant--encodes all essential kinetic and thermodynamic information. We introduce a framework that casts committor estimation as a stochastic optimal control (SOC) problem. In this formulation the committor defines a feedback control--proportional to the gradient of its logarithm--that actively steers trajectories toward the reactive region, thereby enabling efficient sampling of reactive paths. To solve the resulting hitting-time control problem we develop two complementary objectives: a direct backpropagation loss and a principled off-policy Value Matching loss, for which we establish first-order optimality guarantees. We further address metastability, which can trap controlled trajectories in intermediate basins, by introducing an alternative sampling process that preserves the reactive current while lowering effective energy barriers. On benchmark systems, the framework yields markedly more accurate committor estimates, reaction rates, and equilibrium constants than existing methods.
子采样自然梯度算法的草图-投影分析
Gil Goldshlager, Jiang Hu, Lin Lin
AI总结 通过将子采样自然梯度下降(SNG)视为草图-投影方法,提出基于平方体积采样的新代理,证明单小批量下SNG方向期望等于预处理梯度下降步,给出全局收敛保证和显式收敛率,并解释SNG相对于SGD的优势在于更有效利用模型雅可比矩阵的谱衰减。
子采样自然梯度下降(SNG)已被用于实现高精度科学机器学习,但基于随机预条件的标准分析无法洞察实际小样本设置。我们通过将SNG分析为草图-投影方法克服了这一限制。受此视角启发,我们摒弃了使用两个独立小批量解耦梯度和预条件的常规理论代理,取而代之的是基于平方体积采样的新代理。在这个新代理下,我们证明即使存在耦合,SNG方向的期望也等于预处理梯度下降步,从而得到:(i) 使用任意大小的单个小批量时的全局收敛保证,以及(ii) 用与草图-投影结构相关的量显式表征收敛速率。这些发现进而为小样本设置提供了新见解,例如表明SNG相对于SGD的优势在于它能更有效地利用模型雅可比矩阵中的谱衰减。我们还扩展这些思想以解释SNG的一种流行结构化动量方案SPRING,通过证明它自然源于加速草图-投影方法。
Subsampled natural gradient descent (SNG) has been used to enable high-precision scientific machine learning, but standard analyses based on stochastic preconditioning fail to provide insight into realistic small-sample settings. We overcome this limitation by instead analyzing SNG as a sketch-and-project method. Motivated by this lens, we discard the usual theoretical proxy which decouples gradients and preconditioners using two independent mini-batches, and we replace it with a new proxy based on squared volume sampling. Under this new proxy we show that the expectation of the SNG direction becomes equal to a preconditioned gradient descent step even in the presence of coupling, leading to (i) global convergence guarantees when using a single mini-batch of any size, and (ii) an explicit characterization of the convergence rate in terms of quantities related to the sketch-and-project structure. These findings in turn yield new insights into small-sample settings, for example by suggesting that the advantage of SNG over SGD is that it can more effectively exploit spectral decay in the model Jacobian. We also extend these ideas to explain a popular structured momentum scheme for SNG, known as SPRING, by showing that it arises naturally from accelerated sketch-and-project methods.
非线性动力学系统中信息损失的Wasserstein几何
Yiting Duan, Zhikun Zhang, Yi Guo
AI总结 针对非线性系统时间延迟重构映射非单射导致的多值演化问题,提出基于测度论框架量化模糊性,引入内在随机性指标,并用k近邻估计实现有限分辨率下的数值计算。
时间延迟嵌入是重构非线性系统动力学的强大技术。然而,重构映射并不总是嵌入,这一条件在实践中很少得到验证。当重构映射非单射时,多个潜在状态可能映射到同一重构状态,导致多值$n$步演化。因此,诱导系统不再允许确定性闭包,未来轨迹的分散导致模糊性。在这项工作中,我们建立了一个测度论框架来量化多值演化引起的模糊性,并引入内在随机性来量化有限时间范围内的模糊性。对于数值实现,我们使用$k$近邻估计器在有限分辨率和有限采样设置下近似内在随机性。在合成和真实世界数据集上的数值实验与预期一致:更接近确定性闭包的重构倾向于产生更低的分数,而将具有较低经验闭包分数的重构作为输入的确定性预测器与更低的展开误差相关,这表明内在随机性为理解重构失败提供了新视角,并可作为选择重构映射的诊断工具。
Time-delay embedding is a powerful technique for reconstructing the dynamics of nonlinear systems. However, the reconstruction map is not always an embedding, a condition rarely verified in practice. When the reconstruction map is non-injective, multiple latent states may map to the same reconstructed state, leading to multi-valued $n$-step evolution. Consequently, the induced system no longer admits a deterministic closure, and the dispersion of future trajectories leads to ambiguity. In this work, we establish a measure-theoretic framework to quantify the ambiguity induced by multi-valued evolution and introduce intrinsic stochasticity to quantify the ambiguity over a finite horizon. For numerical implementation, we use the $k$-nearest-neighbor estimator to approximate intrinsic stochasticity under finite-resolution and finite-sampling settings. Numerical experiments on the synthetic and real-world datasets are consistent with the expectation: reconstructions closer to deterministic closure tend to produce lower scores, and deterministic predictors that take reconstructions with lower empirical closure scores as input are associated with lower rollout errors, suggesting that intrinsic stochasticity provides a new perspective for understanding failures of reconstruction and serves as a diagnostic for selecting reconstruction maps.
面向组合优化的潜在引导采样
Sobihan Surendran, Adeline Fermanian, Sylvain Le Corff
AI总结 提出LGS-Net潜在空间模型,结合马尔可夫链蒙特卡洛与随机逼近的潜在引导采样方法,在路由任务上达到最先进性能。
组合优化问题在物流、制造和药物发现等领域广泛存在,但其NP-hard性质使其计算上具有挑战性。最近的神经组合优化(NCO)方法利用深度学习来学习构建解的策略,通过监督学习或强化学习进行训练。尽管有前景,但这些方法通常依赖于任务特定的增强,在分布外实例上表现不佳,并且缺乏鲁棒的推理机制。此外,现有的潜在空间模型要么需要标记数据,要么使用与实例无关的潜在分布。在这项工作中,我们提出了LGS-Net,一种新颖的以问题实例为条件的潜在空间模型,并引入了一种高效的推理方法——潜在引导采样(LGS),基于马尔可夫链蒙特卡洛和随机逼近。我们证明了我们方法的迭代形成一个时间非齐次马尔可夫链,并提供了严格的理论收敛保证。在基准路由任务上的实证结果表明,我们的方法在NCO基线中达到了最先进的性能。
Combinatorial Optimization problems are widespread in domains such as logistics, manufacturing, and drug discovery, yet their NP-hard nature makes them computationally challenging. Recent Neural Combinatorial Optimization (NCO) methods leverage deep learning to learn policies for constructing solutions, trained via Supervised or Reinforcement Learning. While promising, these approaches often rely on task-specific augmentations, perform poorly on out-of-distribution instances, and lack robust inference mechanisms. Moreover, existing latent space models either require labeled data or use an instance-independent latent distribution. In this work, we propose LGS-Net, a novel latent space model that conditions on problem instances, and introduce an efficient inference method, Latent Guided Sampling (LGS), based on Markov Chain Monte Carlo and Stochastic Approximation. We show that the iterations of our method form a time-inhomogeneous Markov Chain and provide rigorous theoretical convergence guarantees. Empirical results on benchmark routing tasks show that our method achieves state-of-the-art performance among NCO baselines.
任意步SDE的Itô映射
Zhengkai Pan, Peter Potaptchik, Wenxi Yao, Michael S. Albergo, Jakiw Pidstrigach
AI总结 提出Itô映射,一种任意步随机流映射,通过单次前向传播预测未来状态,实现随机动力学的精确蒸馏,并支持推理时控制和后验采样。
最近的单步生成模型通过学习底层动力学的确定性流映射来加速采样。这些方法依赖于从常微分方程学习,但如何为随机动力学定义精确的蒸馏过程仍是开放问题。我们引入Itô映射,一种任意步随机流映射,它接收中间状态和布朗路径,并在单次前向传播中预测未来状态。Itô映射公式通过提供廉价、可微的后验样本访问,为推理时控制提供了新的估计器。实验上,Itô映射从固定的中间状态生成多样、条件有效的端点样本,并在合成和图像生成基准上支持强引导性能。这些结果确立了任意步SDE积分作为后验采样和随机控制的有用原语。
Recent one-step generative models accelerate sampling by learning deterministic flow maps of the underlying dynamics. These methods rely on learning from ordinary differential equations, leaving open how to define an exact distillation procedure for stochastic dynamics. We introduce the Itô map, an any-step stochastic flow map that takes an intermediate state and Brownian path and predicts future states in a single pass. The Itô map formulation yields novel estimators for inference-time control by providing cheap, differentiable access to posterior samples. Empirically, Itô maps produce diverse, conditionally valid endpoint samples from fixed intermediate states and support strong steering performance on synthetic and image-generation benchmarks. These results establish any-step SDE integration as a useful primitive for posterior sampling and stochastic control.
广义共形预测系统在分布偏移下的应用
Jef Jonkers, Johanna Ziegel
AI总结 针对分布偏移,通过观测特定置换权重编码偏移,扩展广义共形预测系统,提出偏移感知预测系统,并引入权重不确定性框构建鲁棒共形预测系统包络,提供有限样本或渐近置信保证。
共形预测系统(CPS)在可交换性假设下输出校准的CDF带。我们通过观测特定的置换权重编码分布偏移,将广义CPS扩展到非可交换设置。这产生了偏移感知预测系统,当测试点(条件于无序样本)是从观测原子中加权抽取时,该系统保持有效。由于此类权重通常需要估计,我们引入了权重不确定性框,并构建了具有有限样本或渐近置信保证的鲁棒CPS包络。我们推导了符合性度量CPS、共形分箱和共形等渗分布回归的高效计算方法。在协变量偏移和反馈驱动的生物分子设计实验下,校准的预测带在更强偏移下变宽,随样本量增加而收紧。
Conformal predictive systems (CPS) output calibrated bands of CDFs under exchangeability. We extend generalized CPS to non-exchangeable settings by encoding distributional shifts through observation-specific permutation weights. This yields shift-aware predictive systems that remain valid whenever the test point is, conditionally on the unordered sample, a weighted draw from the observed atoms. Since such weights are typically estimated, we introduce weight-uncertainty boxes and construct robust CPS envelopes with finite-sample or asymptotic confidence guarantees. We derive efficient computation for conformity-measure CPS, conformal binning, and conformal isotonic distributional regression. Experiments under covariate shift and feedback-driven biomolecular design show calibrated predictive bands that widen under stronger shifts and tighten as sample size increases.
通过校准视角看人机协作
Eric Nalisnick, Chi Zhang, Sophia Qian, Yixin Wang
AI总结 研究通过统计校准视角分析人机协作模型,发现组合方法不保留人类校准度,而委托方法将校准负担转移给拒绝器元模型,且当人类依赖系统不可观测信息时无法实现。
我们通过统计校准的视角研究人机协作模型。假设团队由AI模型和人类组成——两者相对于特征空间的某种划分都是校准的——并揭示校准假设如何传播到协作框架中。特别地,我们考虑两种框架:(i) 结合人类和模型预测,或 (ii) 将预测责任委托给人类或模型。通过理论和实证结果,我们表明现有的组合方法不保留人类的校准程度。委托方法(通过委托行为本身)保留了后续预测器的校准,但将负担转移到了决定谁进行预测的拒绝器元模型上。拒绝器必须足够精细地校准,以定位每个成员的优势所在,这一需求随着人类专业知识的增长而增加,并且当人类依赖系统无法观测的信息时变得无法实现。
We study models for human-AI teaming through the lens of statistical calibration. We assume the team consists of an AI model and human -- both of which are calibrated with respect to some partitioning of the feature space -- and expose how the calibration assumptions propagate into the teaming framework. In particular, we consider frameworks that either (i) combine human and model predictions or (ii) delegate prediction responsibility to either a human or model. We show via theoretical and empirical results that existing methods for combination do not preserve the human's degree of calibration. Methods for delegation (by the very act of delegation) preserve calibration of the downstream predictors but shift the burden onto the rejector meta-model that decides who predicts. The rejector must be calibrated finely enough to locate where each member is superior, a demand that grows with the human's expertise and becomes unattainable when the human relies on information the system cannot observe.
基于玻尔兹曼间隔的kNN分类近指数收敛速率
Luyuan Yang, Shayan Shafaei, Chao Lan
AI总结 提出玻尔兹曼间隔条件,介于Tsybakov与Massart间隔之间,首次证明kNN分类器可实现近指数收敛速率。
分类器的收敛速率分析通常在Tsybakov间隔或Massart间隔下进行。前者是相对较弱的条件,通常产生多项式速率,而后者更强,但能保证指数速率。本文引入一种新条件,称为玻尔兹曼间隔,它填补了这两种机制之间的空白。该条件弱于Massart间隔,通常强于Tsybakov间隔,并在适当条件下能蕴含它们的许多性质。我们将玻尔兹曼间隔应用于kNN分类器的分析,并建立了kNN分类的第一个近指数收敛速率。我们还给出了主要结果的扩展,并提供了支持主要理论结论的数值证据。
Convergence-rate analysis for classifiers is often conducted under either Tsybakov margin or Massart margin. The former is a relatively weak condition that typically yields polynomial rates, while the latter is substantially stronger but can guarantee exponential rates. In this paper, we introduce a new condition, called Boltzmann margin, that bridges the gap between these two regimes. It is weaker than Massart margin, generally stronger than Tsybakov margin, and can imply many of their properties under suitable conditions. We apply Boltzmann margin to the analysis of kNN classifiers and establish the first near-exponential convergence rates for kNN classification. We also present extensions of the main results and provide numerical evidence supporting the main theoretical implications.
面向流式广告中节奏控制的决策校准共形不确定性
Prashant Shekhar, Caroline Howard
AI总结 提出一种决策校准共形框架,通过衡量预测误差对实际部署策略的最大影响来校准不确定性,理论证明该分数是保护所有可部署节奏控制策略的最小有效不确定性度量,并在公开数据集上显著降低不确定性半径。
我们开发了一个决策校准的共形框架,用于流式广告中的节奏控制决策。节奏控制依赖于不确定的未来库存、需求压力、增量响应和会员体验负载。该框架不是校准通用的预测残差,而是通过预测误差对实际可能部署的策略的最大影响来衡量预测误差。主要定理表明,所提出的分数是统一保护所有可部署节奏控制策略的最小有效不确定性度量。几何上,它是有符号策略敏感性集的支持函数。分裂共形校准为该分数提供了有限样本覆盖。一个高维分离定理表明,传统的残差校准可能因支付干扰库存维度而任意保守,而一个鲁棒的节奏控制结果结合了库存、响应和体验不确定性。在基于Criteo Uplift和KuaiRand数据集构建的公开数据校准节奏控制回放中,传统共形节奏控制仍然未解决,在Criteo上残差半径高达7236.7,在KuaiRand上为4629.4。采用所提出的决策校准方法,不确定性半径分别降至18.4和278.6,并为价值、交付、预算和会员负载设置了单独的边际。在Criteo上,所提出的方法证明了比点预测基线更不激进的节奏控制策略,并将保留的任何违规率从16.7%降至3.3%,且预算和会员负载违规为零。在KuaiRand上,选择仍未解决。简而言之,本文确立了预测、响应估计和会员体验模型应根据它们是否缩小节奏控制决策使用的不确定性来判断,因为这会导致自信且不过度保守的决策。
We develop a decision-calibrated conformal framework for pacing decisions in streaming advertising. Pacing depends on uncertain future inventory, demand pressure, incremental response, and member-experience load. Instead of calibrating a generic forecast residual, the framework measures forecast error by its largest impact on the policies that could actually be deployed. The main theorem shows that the proposed score is the smallest valid uncertainty measure that uniformly protects all deployable pacing policies. Geometrically, it is the support function of the signed policy sensitivity set. Split conformal calibration gives finite-sample coverage for this score. A high-dimensional separation theorem shows that traditional residual calibration can be arbitrarily more conservative by paying for nuisance inventory dimensions, and a robust pacing result combines inventory, response, and experience uncertainty. On public-data-calibrated pacing replays built from Criteo Uplift and KuaiRand datasets, traditional conformal pacing remains unresolved with high residual radii of 7236.7 on Criteo and 4629.4 on KuaiRand. With the proposed decision calibration approach, the uncertainty radii are reduced to 18.4 and 278.6 respectively, with separate margins for value, delivery, budget, and member load. On Criteo, the proposed method certifies a less aggressive pacing policy than the point-forecast baseline, and reduces held-out any-violation rate from 16.7% to 3.3%, with zero budget and member-load violations. On KuaiRand, the choice remains unresolved. In a nutshell, the paper establishes that forecasts, response estimates, and member-experience models should be judged by whether they shrink the uncertainty that the pacing decision uses, as this leads to confident decisions that are not overly conservative.
鲁棒主动学习用于文本到SQL中的少样本示例选择
Arash Pourhabib
AI总结 针对文本到SQL中少样本示例选择,提出一种鲁棒主动学习方法,通过分层贪婪算法最大化异方差互信息目标,在嵌入流形上实现常数因子近似保证,显著减少标注成本。
少样本示例检索是将大型语言模型(LLM)应用于特定领域文本到SQL系统的主要范式。然而,标注示例库的质量直接决定系统准确性,且专家标注成本高昂。我们将这些示例的主动选择形式化为一个在语义查询嵌入的内在低维流形上的约束实验设计问题。与标准主动学习框架不同,我们的设置引入了三个关键挑战:依赖于查询的可变标注可靠性(异方差性)、跨语义主题的空间多样性严格要求(划分拟阵约束),以及嵌入空间真实协方差结构未知的固有现实(模型误设)。为了解决这些问题,我们提出了一种分层贪婪算法,该算法最大化异方差互信息目标。我们证明该目标在内在流形上保持子模性和近似单调性,从而得到理论上的常数因子近似保证。我们建立了一个谱界,表明当假设的替代核与真实数据生成过程存在偏差时,该近似保证会优雅地退化,而非灾难性地崩溃。实验结果表明,所提出的策略显著减少了标注工作量,同时保持了较高的文本到SQL检索准确性。
Few-shot example retrieval is the dominant paradigm for grounding large language models (LLMs) in domain-specific text-to-SQL systems. However, the quality of the annotated example bank directly governs system accuracy, and expert annotation is prohibitively expensive. We formalize the active selection of these examples as a constrained experimental design problem over the intrinsic, low-dimensional manifold of semantic query embeddings. Unlike standard active learning frameworks, our setting introduces three critical challenges: varying, query-dependent annotation reliability (heteroscedasticity), strict requirements for spatial diversity across semantic topics (partition matroid constraints), and the inherent reality that the true covariance structure of the embedding space is unknown (misspecification). To address these, we propose a stratified greedy algorithm that maximizes a heteroscedastic mutual information objective. We prove that this objective remains submodular and approximately monotonic on the intrinsic manifold, yielding a theoretical constant-factor approximation guarantee. We establish a spectral bound demonstrating that this approximation guarantee degrades gracefully, rather than catastrophically, when the assumed surrogate kernel diverges from the true underlying data-generating process. Empirical results demonstrate that the proposed strategy significantly reduces labeling effort while maintaining high text-to-SQL retrieval accuracy.
有限精度下学习Tanh神经网络的局限性
Philipp Grohs, Matěj Trödler
AI总结 基于有限精度计算和L^p精度保证,通过构造尖锐局部化bump函数,证明自适应随机算法在L^p范数下收敛速度不超过蒙特卡洛率O(m^{-1/p}),除非采样预算随网络参数和架构指数增长。
我们研究了在有限精度计算和$L^p$精度保证下,从点评估中学习$\ anh$神经网络的局限性,建立在Berner、Grohs和Voigtländer(2023)的工作基础上。我们的方法基于通过迭代$\ anh$激活函数新颖构造的尖锐局部化bump函数。利用这一机制,我们证明,在有限精度设置下,基于$m$个样本的自适应随机算法在$L^p$范数下无法达到比蒙特卡洛率$O(m^{-1/p})$更高的收敛速度,除非采样预算随网络参数和架构的大小指数增长。结果揭示了有限精度对包含局部化bump函数的类别可学习性施加的基本限制,将先前针对ReLU网络的结果推广到了$\ anh$设置。
We investigate limitations of learning $\tanh$ neural networks from point evaluations under finite-precision computations and $L^p$ accuracy guarantees, building on Berner, Grohs, and Voigtländer (2023). Our approach is based on a novel construction of sharply localized bump functions via iterated $\tanh$ activations. Using this mechanism, we show that, in a finite-precision setting, no adaptive randomized algorithm based on $m$ samples can achieve a convergence rate higher than the Monte Carlo rate $O(m^{-1/p})$ in the $L^p$ norm, unless the sampling budget grows exponentially with the size of the network parameters and architecture. The results reveal fundamental limitations imposed by finite precision on the learnability of classes containing localized bump functions, extending previous results for ReLU networks to the $\tanh$ setting.
用于蛋白质性质预测的灵活核函数
Martin Jankowiak, Yerdos Ordabayev, Rudraksh Tuwani, Henry N. Ward, Hunter Nisonoff, James M. McFarland, Gevorg Grigoryan
AI总结 提出利用进化替代矩阵和局部线性性的序列核函数,结合高斯过程实现数据高效的蛋白质性质预测,并融入结构信息进行多任务学习。
尽管对蛋白质设计应用至关重要,但从稀疏实验数据预测蛋白质性质(如结合亲和力和热稳定性)仍然是一个重大挑战。因此,我们引入了一类序列核函数,利用进化替代矩阵以及局部线性性,并证明由此产生的高斯过程为蛋白质性质景观提供了数据高效的模型,通常优于依赖基础模型嵌入的替代方法。此外,通过学习实际上是结构感知的替代矩阵,我们展示了我们的核函数可以轻松地整合来自基础模型的结构信息。我们证明了这些结构条件核函数非常适合跨多个蛋白质性质景观的多任务学习,并且可以显著优于局部监督学习方法。
Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data-efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore--by learning what are in effect structure-aware substitution matrices--we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.
神经网络中数据对称性导致的守恒律
Jakob Galley, Vahid Shahverdi, Axel Flinth
发表机构 * Umeå University(于默奥大学)
AI总结 研究训练数据的对称性是否在梯度流训练中产生守恒量,证明对于解析非多项式损失函数,数据对称性一般不产生额外守恒量;对于均方误差损失,数据增强可产生额外守恒量,并利用可张量化网络框架描述该现象。
我们探讨训练数据的内在对称性是否在神经网络的梯度流训练中导致守恒量。在假设损失函数是解析且非多项式的情况下,我们证明数据对称性通常不会诱导任何额外的运动积分。另一方面,对于均方误差(MSE)损失,存在数据增强产生额外守恒量的情况。我们构建了一个利用\emph{可张量化网络}来描述这一现象的框架。可张量化网络是一类架构,其参数和输入的依赖关系可以通过中间表示分离。它们包括线性网络、多项式网络以及闪电注意力(Lightning Attention)。
We explore whether intrinsic symmetries of the training data lead to conserved quantities during gradient-flow training of neural networks. Under the assumption that the loss function is analytic and non-polynomial, we prove that data symmetries generically do not induce any additional integrals of motion. For mean squared error (MSE) loss, on the other hand, there are situations in which data augmentation yields extra conserved quantities. We build a framework, utilizing \emph{tensorizable networks} to describe this phenomenon. Tensorizable networks are a family of architectures whose dependence on parameters and inputs can be separated using an intermediate representation. They include linear and polynomial networks, as well as Lightning Attention.
SPACR: 单次自适应训练的不确定性感知共形回归器
Soundouss Messoudi, Sylvain Rousseau, Sébastien Destercke
发表机构 * Heudiasyc - UMR CNRS 7253, Université de Technologie de Compiègne(法国贡比涅技术大学 - CNRS 7253联合实验室 Heudiasyc)
AI总结 提出SPACR方法,通过可微损失直接训练不确定性感知回归器,联合优化效率和有效性,无需批分割或预定义置信水平,单个模型在推理时支持多置信水平预测区间,实验表明其区间更窄、覆盖-效率权衡更优且计算成本更低。
共形预测(CP)为预测模型提供了鲁棒的不确定性保证,但通常事后应用,这导致模型训练与产生高效(即窄)区间的共形目标不一致。我们提出SPACR(单次自适应共形回归器),一种在可微损失内直接训练不确定性感知回归器的新方法。SPACR联合优化效率和有效性,无需在训练期间进行批分割或预定义置信水平。因此,单个SPACR模型在推理时能在多个置信水平下产生有效的预测区间,避免了像DOICR等方法所需的高成本重训练。在多个数据集上的实验表明,与标准CP和DOICR相比,SPACR始终提供更紧的区间和更好的覆盖-效率权衡,同时显著降低计算成本。
Conformal Prediction (CP) provides robust uncertainty guarantees for predictive models, but is typically applied post hoc, which misaligns model training with the conformal goal of producing efficient (i.e, narrow) intervals. We propose SPACR (Single-Pass Adaptive Conformal Regressor), a novel method for directly training uncertainty-aware regressors within a differentiable loss. SPACR jointly optimizes efficiency and validity without batch-splitting or a predefined confidence levels during training. As a result, a single SPACR model yields valid prediction intervals at multiple confidence levels during inference, avoiding the costly retraining required by methods like DOICR. Experiments on diverse datasets show that SPACR consistently gives tighter intervals and better coverage-efficiency trade-offs compared to standard CP and DOICR, while significantly reducing computational costs.
TENP:用于混合专家的梯形专家神经元剪枝
Jiangyang He, Shaolin Zhu, Deyi Xiong
AI总结 提出TENP框架,通过识别重要专家并对其余专家进行神经元剪枝,保留梯形参数模式,在40%路由专家稀疏度和平均63.76%激活参数下,DeepSeek模型准确率仅下降1点,代码生成任务提升10%。
混合专家大语言模型通过稀疏激活实现高效扩展,但其部署受到专家大量静态参数占用的根本限制。现有压缩方法要么移除整个专家,破坏路由拓扑并损害性能,要么依赖非结构化权重剪枝,实际效率有限。为解决这些局限,我们提出TENP,一种结构化的梯形专家神经元剪枝框架。使用少量样本,我们识别并保留重要专家,同时对次要专家应用专家神经元剪枝(ENP),从浅层到深层以梯形模式保留模型参数。在评估专家重要性时,我们联合考虑专家输出的幅度及其改变输入向量方向的能力。对于ENP,我们测量每个神经元对专家输出的投影贡献,以识别并保留重要神经元。我们在Qwen和DeepSeek模型上进行了广泛实验。在路由专家稀疏度为40%且平均激活63.76%专家参数的情况下,DeepSeek模型相比全参数模型准确率仅下降1点。此外,在代码生成任务上,它比全参数模型提升10%。
Mixture-of-Experts large language models (LLMs) scale efficiently through sparse activation, yet their deployment is fundamentally constrained by the large static parameter footprint of experts. Existing compression approaches either remove entire experts, disrupting routing topology and harming performance, or rely on unstructured weight pruning with limited practical efficiency. To address the limitations, we propose TENP, a structured Trapezoidal ExpertNeuron Pruning framework. Using a few samples, we identify and retain important experts, while applying expert neuron pruning (ENP) to less important experts, reserving model parameters in a trapezoidal pattern from shallow to deep layers. When evaluating expert importance, we jointly consider both the magnitude of the expert output and its ability to change the direction of the input vector. For ENP, we measure each neuron's projected contribution to the expert output to identify and retain important neurons. We conduct extensive experiments on the Qwen and DeepSeek models. Under a routing expert sparsity of 40% and an average of 63.76% activated expert parameters, the DeepSeek model suffers only a 1-point drop in accuracy compared to the full-parameter model. Moreover, it outperforms the full-parameter model by 10% on code generation tasks.
集成局部和全局熵用于大语言模型的不确定性量化
Johanne Medina, Tianyi Zhou, Keivin Isufaj, Aristides Gionis, Sanjay Chawla
AI总结 本文提出GLU方法,通过融合隐藏状态几何熵(全局)和token级熵(局部)来量化LLM不确定性,有效捕捉自信但错误的失败模式,无需额外训练。
大语言模型会自信地产生幻觉,使得不确定性量化(UQ)对于可靠部署至关重要。现有方法主要依赖token级信号,而中间隐藏状态的几何结构未被充分利用。在本文中,我们将隐藏状态矩阵的几何复杂度作为LLM全局不确定性的度量,同时将token级不确定性估计视为局部度量。我们表明,隐藏状态几何熵(全局不确定性)和token级熵(局部不确定性)在统计上近似正交,捕捉了可靠性预测的不同失败模式。特别地,全局几何恢复了局部信号系统性遗漏的自信但错误的失败模式。基于此,我们提出了全局-局部不确定性(GLU),这是一种无监督、单次前向传播的分数,通过乘法门融合两种信号。在三个模型族和六个基准测试中,GLU匹配或优于所有无监督基线,同时仅需一次前向传播,且保持长度归一化和架构无关性。
Large language models hallucinate confidently, making uncertainty quantification (UQ) essential for reliable deployment. Existing methods rely predominantly on token-level signals, leaving the geometric structure of intermediate hidden states underused. In this paper, we take the geometric complexity of hidden-state matrices as a measure of the global uncertainty of LLMs, while treating token-level uncertainty estimation as a local metric. We show that hidden-state geometric entropy (global uncertainty) and token-level entropy (local uncertainty) are statistically near-orthogonal, capturing distinct failure regimes for reliability prediction. In particular, global geometry recovers the confident-but-wrong failure mode that local signals systematically miss. Building on this, we propose Global-Local Uncertainty (GLU), an unsupervised, single-pass score that fuses the two signals via a multiplicative gate. Across three model families and six benchmarks, GLU matches or outperforms all unsupervised baselines while requiring only a single forward pass and remaining length-normalized and architecture-agnostic.
使用概率程序训练大型语言模型的归纳推理
Liyi Zhang, Akshay K. Jagadish, Brenden M. Lake, Thomas L. Griffiths
AI总结 提出基于程序的后验训练(PPT)方法,利用LLM生成概率程序场景,通过推理产生分布目标,微调模型以提升归纳推理准确性、与人类判断的一致性及校准能力。
大型语言模型(LLM)的后训练推理通常专注于数学和编码等演绎任务,其中正确性可验证。然而,许多现实世界的推理问题是归纳性的:智能体必须从稀疏、模糊的观测中推断不确定的信念。使用标准微调方法进行归纳推理面临挑战,包括难以策划大规模、高质量标注数据集以及处理本质上是分布式的目标。在这项工作中,我们引入了一种称为基于程序的后验训练(PPT)的新方法来解决这些局限性:我们使用LLM生成多样化的开放世界场景作为概率程序,运行概率推理以产生查询的分布式目标响应,然后在这些概率软标签上进行微调。使用这种方法,我们在10,000个程序生成的场景上微调LLM,并在保留的模板、人工标注的判断和外部基准上进行评估。总体而言,PPT显著提高了保留归纳任务的估计准确性,增强了与人类判断的一致性,并迁移到估计和校准的外部基准。此外,原始校准的增益并未被事后温度缩放所涵盖,表明与输出重新缩放相比,模型更深入地内化了不确定性。这些结果表明,概率程序介导的微调是一种有前景的方法,用于后训练LLM以可靠地执行近似归纳推理。
Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional. In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.
Express 语言建模
Albert Gong, Annabelle Michael Carrell, Raaz Dwivedi, Lester Mackey
AI总结 提出 Express 工具,将非因果注意力近似转换为因果近似,结合 Thinformer 实现最优因果注意力保证,并加速语言建模中的四个资源瓶颈。
我们引入了一个新工具 Express,用于将非因果注意力近似转换为具有匹配近似保证的因果近似。当与最先进的 Thinformer 近似结合时,Express 改进了已知的最佳因果注意力保证,对于长度为 $n$ 的序列,实现了 $\log^{3/2}(n)/s$ 的近似误差,仅需 $O(s)$ 内存和 $O(s^2 \log^2(n))$ 的压缩开销。我们将这些进展与高效的 I/O 感知 Triton 实现相结合,展示了相对于 FlashAttention 2 的显著加速,并使用 Express 克服了语言建模流程中的四个资源瓶颈:长上下文预填充、KV 缓存压缩、长形式内存受限解码和长形式计算受限解码。
We introduce a new tool, Express, for converting a non-causal attention approximation into a causal approximation with matching approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering $\log^{3/2}(n)/s$ approximation error with only $O(s)$ memory and $O(s^2 \log^2(n))$ compression overhead for a sequence of length $n$. We pair these developments with an efficient I/O-aware Triton implementation, demonstrate substantial speedups over FlashAttention 2, and use Express to overcome four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compression, long-form memory-constrained decoding, and long-form compute-constrained decoding.
范围惩罚:理论洞见及其在联邦学习中的应用
Yiyuan She, Zhaojun Hu, Yifan Sun
AI总结 提出范围正则化方法,通过极值聚类实现跨客户端正则化,并开发非渐近统计精度与模式恢复的新证明技术,以及利用局部强凸性的快速优化算法。
本文针对具有线性系统组件的联邦学习引入范围正则化,以提高统计精度并诱导跨客户端正则性,从而有利于量化、编码和资源效率。我们的方法识别不同客户端之间共享权重的特征,并将个性化特征的权重自适应地聚类到极值,这一过程称为极值聚类。由于正则化子的半范数性质和不可分解性,相关估计量的理论分析面临重大挑战。我们为统计精度和忠实模式恢复的非渐近分析开发了新的证明技术。此外,提出了一种利用不同程度局部强凸性的快速优化算法,以降低迭代复杂度。实验支持了所提方法的有效性和效率。
This paper introduces range regularization for federated learning with linear systematic components to enhance statistical accuracy and induce cross-client regularity conducive to quantization, coding, and resource efficiency. Our approach identifies features with shared weights across different clients and adaptively clusters the weights of personalized features at extreme values, a process we refer to as polar clustering. Theoretical analysis of the associated estimators poses significant challenges due to the seminorm nature and non-decomposability of the regularizer. We develop new proof techniques for the nonasymptotic analysis of statistical accuracy and faithful pattern recovery. Moreover, a fast optimization algorithm that leverages varying degrees of local strong convexity is proposed to reduce iteration complexity. Experiments support the efficacy and efficiency of the proposed approach.
Gromov--Wasserstein空间中的$k$-最近邻
Kaitlyn Hohmeier, Nicolas Fraiman, Caroline Moosmueller
AI总结 本文在Gromov-Wasserstein距离框架下实现k-最近邻分类,证明了度量测度空间和图上分类器的普适一致性,并通过实验验证了其有效性。
Gromov--Wasserstein (GW) 距离为比较度量测度空间提供了一个框架,无论其底层结构或几何形状如何。对于基于网络的数据,它能够直接比较具有不同节点数量的图,无需嵌入或其他抽象。此外,通过GW的变体——融合Gromov--Wasserstein (fGW),还可以在图形结构之外结合节点特征。在这项工作中,我们使用GW和fGW距离实现了$k$-最近邻 ($k$-NN) 分类。我们证明了在具有有限支撑和均匀概率测度的度量测度空间等价类空间上,GW-$k$-NN分类器的普适一致性。通过将图视为具有成对距离度量和节点上均匀概率测度的有限支撑度量测度空间,我们获得了图空间上GW-$k$-NN的普适一致性。类似地,对于fGW-$k$-NN,我们证明了在由具有有限支撑和均匀概率测度的度量测度空间以及到欧几里得空间的特征映射组成的结构化对象的弱同构类空间上的普适一致性,从而建立了节点属性图空间上的普适一致性。我们的数值实验表明,GW-$k$-NN和fGW-$k$-NN在多个图数据集上始终表现良好,这表明诸如$k$-NN之类的度量分类器在GW框架中效果良好。
The Gromov--Wasserstein (GW) distance provides a framework for comparing metric measure spaces, regardless of their underlying structure or geometry. For network-based data, it enables direct comparisons of graphs with different numbers of nodes, without requiring an embedding or other abstraction. Furthermore, through a variant of GW known as fused Gromov--Wasserstein (fGW), it is also possible to incorporate node features in addition to graph structure. In this work, we implement $k$-nearest neighbors ($k$-NN) classification using the GW and fGW distances. We prove the universal consistency of the GW-$k$-NN classifier on the space of equivalence classes of metric measure spaces with finite support and uniform probability measure. By viewing graphs as finitely supported metric measure spaces equipped with the pairwise distance metric and a uniform probability measure on the nodes, we obtain universal consistency of GW-$k$-NN for the space of graphs. Likewise for fGW-$k$-NN, we prove universal consistency on the space of weak isomorphism classes of structured objects consisting of metric measure spaces with finite support and uniform probability measure and feature maps into Euclidean space, thus establishing universal consistency on the space of node-attributed graphs. Our numerical experiments show that GW-$k$-NN and fGW-$k$-NN consistently perform well across multiple graph datasets, suggesting that metric classifiers such as $k$-NN work well in the GW framework.
当前状态数据下神经网络估计的收敛速度
Yuan Wu, Tianhui Zhou
AI总结 针对当前状态数据,提出非参数神经网络筛最大似然估计器,结合ReLU网络逼近理论与经验过程论证,在Hölder光滑假设下建立显式收敛速度。
当前状态数据出现在事件时间仅通过一个指示变量(是否在检查时间之前发生)被观测到时。本文研究了事件时间条件累积分布函数的非参数神经网络筛最大似然估计器。在Hölder光滑假设下,我们通过结合整流线性单元神经网络的逼近理论与经验过程论证,建立了显式收敛速度。这一结果为当前状态观测下的神经网络估计及后续推断提供了理论支持。
Current-status data arise when an event time is observed only through an indicator of whether it occurred before an examination time. This paper studies a nonparametric neural-network sieve maximum likelihood estimator of the conditional cumulative distribution function of the event time. Under Hölder smoothness assumptions, we establish an explicit convergence rate by combining approximation theory for rectified linear unit neural networks with empirical-process arguments. This result provides theoretical support for neural-network estimation and subsequent inference under current-status observation.
交叉熵训练下多头自注意力的平均场分析
Cheng Huan, Hongfwei Yuan
AI总结 针对交叉熵最小化训练的单层因果多头自注意力模型,建立平均场理论,证明有限头近似界、刻画全局极小元、建立传播混沌估计,并研究偏微分方程的长时间行为。
本文针对通过交叉熵最小化训练的简化单层因果多头自注意力模型,发展了一套平均场理论。每个注意力头被视为参数空间中的一个粒子,头的经验律被用作大头状态变量。在无限头极限下,平均注意力logits定义了概率测度上的风险泛函,其一阶变分生成非线性Wasserstein梯度流方程。与通常关注平方损失回归的浅层网络经典平均场分析不同,本模型包含交叉熵目标中的softmax残差以及掩码自注意力的查询-键-值结构。我们证明了最优风险的静态有限头近似界,通过变分支撑条件刻画了全局极小元,并建立了将有限头随机梯度下降与极限偏微分方程进行比较的定量有限时间传播混沌估计。然后我们研究了偏微分方程的长时间行为:能量耗散、在紧致性下收敛到平稳集、在拓扑或Kurdyka--Łojasiewicz假设下收敛到单个平稳测度、以及在梯度主导条件下的显式收敛速率。最后,我们在Wasserstein强单调性条件下证明了局部指数稳定性,并给出了Dirac平稳测度的可验证稳定性和不稳定性判据。这些结果为注意力头训练提供了一个严格的基线平均场框架,并阐明了从平稳性到收敛性和稳定性所需的额外紧致性、景观和曲率假设。
This paper develops a mean-field theory for a simplified single-layer causal multi-head self-attention model trained by cross-entropy minimization. Each attention head is treated as a particle in parameter space, and the empirical law of the heads is used as the large-head state variable. In the infinite-head limit, the averaged attention logits define a risk functional on probability measures, whose first variation generates a nonlinear Wasserstein gradient-flow equation. Unlike classical mean-field analyses of shallow networks that often focus on square-loss regression, the present model contains the softmax residual from the cross-entropy objective and the query-key-value structure of masked self-attention. We prove a static finite-head approximation bound for the optimal risk, characterize global minimizers through a variational support condition, and establish a quantitative finite-time propagation-of-chaos estimate comparing finite-head stochastic gradient descent with the limiting PDE. We then study the long-time behavior of the PDE: energy dissipation, convergence to the stationary set under compactness, convergence to a single stationary measure under topological or Kurdyka--Łojasiewicz assumptions, and explicit convergence rates under gradient-domination conditions. Finally, we prove local exponential stability under a Wasserstein strong-monotonicity condition and give verifiable stability and instability criteria for Dirac stationary measures. The results provide a rigorous baseline mean-field framework for attention-head training and clarify the additional compactness, landscape, and curvature assumptions needed to pass from stationarity to convergence and stability.
MLP残差网络的秩坍缩、不动点与重正化群结构
Parviz Haggi-Mani, Irina Rish
AI总结 本文通过MLP残差网络在合成马尔可夫链上的掩码预测任务,首次定量证明网络深度方向存在选择性秩坍缩,对应重正化群中的相关自由度整合,并发现层间核漂移集中在少数转换处。
深度神经网络前向传播与重正化群流之间的类比在文献中反复被提及,但现有处理仍是定性的:深度被描述为粗粒化尺度,注意力被比作配分函数,表示被认为流向不动点。尚无工作定义可测量的RG序参量,在输入分布受控变化下测试它,或做出经实验验证的定量预测。我们研究了类比可处理的最简单架构:一个纯MLP残差堆栈,在具有已知谱性质的合成马尔可夫链序列上训练掩码标记预测。我们报告三个发现。(i) 训练后残差流的有效秩随深度单调递减,与无关自由度的逐步整合一致。(ii) 这种秩坍缩是选择性的:它发生在相关长度约1的短链上,但在相关长度约7的长链上不存在(在位置级别测量以控制均值池化伪影)。网络精确保留了预测任务相关的自由度,即RG相关性判据的内容。(iii) 层间核漂移集中在一两个特定转换处,网络其余部分接近不动点,与离散不动点平台一致。这些发现共同构成了首个定量的位置级证据,表明MLP残差网络实现了由输入分布谱结构控制的选择性粗粒化过程。
The analogy between deep neural network forward passes and renormalization group (RG) flows has been repeatedly noted in the literature, but existing treatments remain qualitative: depth is described as a coarse-graining scale, attention is likened to a partition function, and representations are said to flow toward fixed points. No existing work has defined a measurable RG order parameter, tested it under controlled variation of the input distribution, or made quantitative predictions that are empirically verified. We study the simplest architecture for which the analogy is tractable: a pure MLP residual stack trained on masked token prediction over synthetic Markov chain sequences with known spectral properties. We report three findings. (i) The effective rank of the residual stream decreases monotonically with depth after training, consistent with progressive integration of irrelevant degrees of freedom. (ii) This rank collapse is selective: it occurs for chains with short correlation length approximately 1 but is absent for chains with long correlation length approximately 7, measured at the position level to control for mean-pooling artifacts. The network preserves exactly the degrees of freedom relevant to the prediction task, the content of the RG relevance criterion. (iii) Inter-layer kernel drift is concentrated at one or two specific transitions, with the remainder of the network near a fixed point, consistent with a discrete fixed-point plateau. Together these findings constitute the first quantitative, position-level evidence that MLP residual networks implement a selective coarse-graining procedure governed by the spectral structure of the input distribution.
非线性最小二乘中基于学习特征几何的泛化性
Ayub Kharel, Ilja Kuzborskij, Patrick Rebeschini, Yasin Abbasi-Yadkori
AI总结 通过算法稳定性分析岭正则化非线性最小二乘的泛化误差,利用经验雅可比Gram矩阵和残差曲率项定义数据依赖的有效维度,并证明其与内在维度而非参数数量相关。
我们通过平均算法稳定性研究了岭正则化非线性最小二乘模型的泛化性,推导了局部极小值点的误差界,该误差界依赖于数据依赖的有效维度,该维度通过经验雅可比Gram矩阵和残差-曲率项反映了训练参数处梯度模型的几何结构。在线性情况下,曲率项消失,这恢复了雅可比核协方差的经典有效维度,但评估的是训练后的模型而非初始化时的模型(如神经正切核分析中常见)。我们进一步通过梯度特征的覆盖复杂度来界定该有效维度,从而得到依赖于学习几何而非参数数量的保证。特别地,对于流形支持的数据和分段Lipschitz雅可比矩阵,界限随内在维度缩放;而对于单隐层ReLU网络,该机制可通过激活稳定区域的数量显式表达。在合成流形、聚类分布和基准数据集上的实验展示了训练后雅可比矩阵的压缩、残差-曲率线性化的紧致性,以及稳定性界限与观测泛化差距的一致性。我们界限的一个关键特征是推导的简洁性,它基于强对数凹噪声下的Brascamp-Lieb不等式从第一性原理得出。
We study the generalization of ridge-regularized nonlinear least-squares models via on-average algorithmic stability, deriving error bounds for local minimizers in terms of a data-dependent effective dimension that reflects the geometry of the gradient model at the trained parameters, through the empirical Jacobian Gram matrix and a residual-curvature term. In the linear case, where the curvature term vanishes, this recovers the classical effective dimension of the Jacobian kernel covariance, but evaluated at the trained model rather than at initialization as is typical in neural tangent kernel analyses. We further bound this effective dimension via covering complexity of the gradient features, leading to guarantees that depend on learned geometry rather than parameter count. In particular, for manifold-supported data and piecewise Lipschitz Jacobians, the bounds scale with intrinsic dimension, while for one-hidden-layer ReLU networks, the mechanism can be made explicit through counts of activation-stable regions. Experiments on synthetic manifolds, clustered distributions, and benchmark datasets illustrate trained-Jacobian compression, the tightness of the residual-curvature linearization, and agreement between the stability bound and observed generalization gaps. A key feature of our bounds is the simplicity of their derivation, which follows from first principles using the Brascamp-Lieb inequality under strongly log-concave noise.
稳定性边缘选择性地塑造数据分布上的学习
Shauna Kwag, Anakha Ganesh, Tomaso Poggio, Pierfrancesco Beneventano
AI总结 本文发现优化中的稳定性边缘(EoS)具有选择性,通过分支干预因果证明了EoS在训练数据子集间重新分配学习,并识别了受益组需满足的两个条件:梯度与Hessian主特征向量对齐,以及梯度幅度持续非零。
现有对稳定性边缘(EoS)的分析将其视为优化的全局属性。我们表明它也具有选择性:稳定性约束在训练分布的各个子集之间重新分配学习,放大某些组上的进展,同时抑制其他组上的进展。通过从相同训练状态进入或退出EoS regime的分支干预,我们因果地证明了这种权衡,并识别了组受益的两个必要条件。首先,其聚合梯度必须与顶部Hessian特征向量对齐。我们通过一个受控扰动隔离了这一机制,该扰动保持距离但随机化方向,破坏了对齐并消除了优势。其次,该组必须随时间保持非零梯度幅度。在交叉熵损失下,梯度饱和使置信度高的组解耦,将优势转移到输出异常值,后者的梯度持续存在。总之,这些结果表明EoS不仅作为稳定性边界,而且作为控制数据分布上学习分配的机制。
Existing analyses of the edge of stability (EoS) treat it as a global property of optimization. We show that it is also selective: the stability constraint redistributes learning across subsets of the training distribution, amplifying progress on some groups while suppressing progress on others. Using a branching intervention that enters or exits the EoS regime from the same training state, we causally demonstrate this trade-off and identify two necessary conditions for a group to benefit. First, its aggregate gradient must align with the top Hessian eigenvector. We isolate this mechanism with a controlled perturbation that preserves distance but randomizes direction, destroying alignment and eliminating the advantage. Second, the group must sustain non-vanishing gradient magnitude over time. Under cross-entropy loss, gradient saturation decouples confidently classified groups, shifting the advantage to output-outliers, whose gradients persist. Together, these results show that EoS functions not only as a stability boundary, but as a mechanism governing the allocation of learning across the data distribution.
具有噪声和不精确侧信息的样本高效归纳矩阵补全
Yuepeng Yang, Cong Ma
AI总结 本文研究了在存在噪声和不精确侧信息的情况下,通过非凸投影梯度下降算法实现样本高效的归纳矩阵补全,提出了一个适用于有效问题规模的正则性条件,实现了线性收敛和估计误差仅依赖于有效问题规模的结论。
低秩矩阵补全是一个广泛研究的问题,具有许多变体。归纳矩阵补全(IMC)结合了行和列的侧信息以显著缩小搜索空间。先前的工作分为两个领域:利用这种结构实现减少样本复杂度的方法,但仅适用于无噪声环境;以及处理噪声但需要样本复杂度与环境矩阵维度相匹配的方法,从而放弃了侧信息应提供的样本效率。在本文中,我们通过研究具有噪声的IMC并使用非凸投影梯度下降算法进行谱初始化来填补这一差距。我们的主要技术贡献是建立一个适用于由有效问题规模决定的减少样本复杂度的IMC损失函数的正则性条件,其规模与侧信息维度而非环境维度成比例。这直接导致了线性收敛和估计误差仅依赖于有效问题规模而非环境矩阵维度。我们进一步将分析扩展到不精确侧信息设置,证明减少的样本复杂度得以保持,并且估计误差在不精确性方面是最佳的。广泛的模拟和在MovieLens数据集上的实际实验验证了我们的理论发现。
Inductive matrix completion (IMC) is a variant of low-rank matrix completion that incorporates row and column side-information. In principle, it can reduce the effective dimension of the recovery problem from the ambient matrix size to the dimension of the side-information features. Existing theory, however, does not fully realize this advantage in the noisy setting: sample-efficient guarantees only apply to noiseless recovery, while noisy guarantees require sample sizes comparable to ordinary matrix completion. This paper closes this gap for noisy IMC. We analyze a nonconvex projected gradient descent algorithm with spectral initialization and prove that, under exact side-information, it achieves linear convergence and stable recovery at a sample complexity governed by the effective side-information dimension rather than the ambient matrix dimension. The key technical ingredient is a local regularity condition for the IMC loss that holds at this reduced sample size, despite the mismatch between the observation pattern and the side-information subspaces. We further extend the analysis to inexact side-information, showing that the same reduced sample complexity is preserved and that the estimation error degrades optimally with the level of subspace misspecification. Motivated by this trade-off, we also propose a penalized interpolation between IMC and ordinary matrix completion that balances sample efficiency against robustness to imperfect side-information. Simulations and experiments on the MovieLens dataset support the theoretical findings and illustrate the practical benefits of exploiting side-information in low-sample regimes.
类别输入模型的精确函数ANOVA分解
Baptiste Ferrere, Nicolas Bousquet, Fabrice Gamboa, Jean-Michel Loubes, Joseph Muré
AI总结 针对类别输入模型,提出一种无需假设的闭式函数ANOVA分解方法,高效处理任意依赖结构,并自然推广SHAP值。
函数ANOVA通过将模型预测分解为主效应和高阶交互,为可解释性提供了原则性框架。对于独立特征,该分解定义明确,与SHAP值紧密相关,并作为加性可解释性的基石。然而,对于一般依赖分布,缺乏显式闭式表达式迫使实践者依赖昂贵的基于采样的近似。我们完全解决了类别输入的这一限制。通过将函数分析与离散傅里叶分析的扩展相结合,我们在没有任何假设的情况下推导出闭式分解。我们的公式计算效率非常高。它无缝地恢复了经典独立情况,并扩展到任意依赖结构,包括具有非矩形支撑的分布。此外,利用SHAP与ANOVA在独立性下的内在联系,我们的框架为一般类别设置提供了SHAP值的自然推广。
Functional ANOVA offers a principled framework for interpretability by decomposing a model's prediction into main effects and higher-order interactions. For independent features, this decomposition is well-defined, strongly linked with SHAP values, and serves as a cornerstone of additive explainability. However, the lack of an explicit closed-form expression for general dependent distributions has forced practitioners to rely on costly sampling-based approximations. We completely resolve this limitation for categorical inputs. By bridging functional analysis with the extension of discrete Fourier analysis, we derive a closed-form decomposition without any assumption. Our formulation is computationally very efficient. It seamlessly recovers the classical independent case and extends to arbitrary dependence structures, including distributions with non-rectangular support. Furthermore, leveraging the intrinsic link between SHAP and ANOVA under independence, our framework yields a natural generalization of SHAP values for the general categorical setting.
盲去噪扩散模型与维度的祝福
Zahra Kadkhodaie, Aram-Alexandre Pooladian, Sinho Chewi, Eero Simoncelli
AI总结 提出盲去噪扩散模型(BDDM),通过不向神经网络传递噪声幅度来简化设计,并在数据内在维度低于环境维度的假设下证明其正确性,实验显示自适应方案的优势。
去噪扩散模型(DDM)是跨多个领域从数据中学习密度的最先进方法,然而训练和采样流程的许多方面仍知之甚少。特别是,噪声调节要求从业者将人为设计的无原则噪声嵌入纳入神经网络架构,并使用临时噪声调度进行采样。为了解决这些缺点,我们提供了\emph{盲去噪扩散模型}(BDDM)的完整理论:这是DDM的一种变体,其中噪声幅度在训练或采样期间不传入神经网络,从而消除了上述设计选择的需要。我们在数据分布相对于环境维度具有低内在维度的假设下证明了BDDM作为采样算法的正确性。这一假设源于从单个噪声样本估计噪声水平的贝叶斯问题的引入,该问题可能具有独立的意义。我们通过实验将BDDM的性能与标准DDM进行比较,展示了我们分析严格证明的\emph{自适应}方案的优势。
Denoising diffusion models (DDMs) are state-of-the-art methods for learning densities from data across numerous domains, yet many aspects of the training and sampling pipeline remain poorly understood. In particular, noise conditioning requires practitioners to incorporate contrived unprincipled noise embeddings into neural network architectures and to use ad hoc noise schedules for sampling. To address these drawbacks, we provide a complete theory for \emph{blind denoising diffusion models} (BDDMs): a variant of DDMs where the noise amplitude is not passed into the neural network during training or sampling, obviating the need for the aforementioned design choices. We justify the correctness of BDDMs as a sampling algorithm under an assumption of low intrinsic dimensionality of the underlying data distribution relative to the ambient dimension. This assumption arises through the introduction of the Bayesian problem of estimating noise levels from a single noisy sample, which might be of independent interest. We empirically compare the performance of BDDMs to standard DDMs, showcasing the benefits of an \emph{adaptive} scheme which is rigorously justified by our analysis.
知情非对称Actor-Critic:利用超越全状态访问的特权信号
Daniel Ebi, Damien Ernst, Klemens Böhm, Gaspard Lambrechts
AI总结 提出知情非对称Actor-Critic框架,允许评论家基于任意状态相关特权信号进行条件化,并证明其产生无偏策略梯度估计;设计两种信息性准则选择最优信号,实验表明精选信号可匹配或超越全状态基线。
非对称强化学习利用训练时可用的特权信息来改善部分可观测条件下的学习。现有的非对称actor-critic方法通常假设在训练期间可以访问完整环境状态以条件化评论家,这在实践中往往不现实。我们引入了知情非对称actor-critic框架,允许评论家基于任意状态相关的特权信号进行条件化,并证明任何此类信号都会产生无偏的策略梯度估计。这大大扩展了可允许的特权信息集,并提出了选择最具信息性信号以促进学习的问题。为此,我们提出了两种新颖的信息性准则:一种基于依赖性的测试,可在训练前应用;另一种基于价值预测改进的测试,可事后应用。在部分可观测基准和合成环境上的实验表明,精心选择的特权信号可以在依赖更少状态信息的同时,匹配或超越全状态非对称基线。
Asymmetric reinforcement learning leverages privileged information available during training to improve learning under partial observability. Existing asymmetric actor-critic methods typically assume access to the full environment state to condition the critic during training, which is often unrealistic in practice. We introduce the informed asymmetric actor-critic framework that allows the critic to be conditioned on arbitrary state-dependent privileged signals, and show that any such signal yields unbiased policy gradient estimates. This substantially expands the set of admissible privileged information and raises the problem of selecting the most informative signals for learning. To this end, we propose two novel informativeness criteria: a dependence-based test that can be applied prior to training, and a test based on improvements in value prediction that can be applied post hoc. Experiments on partially observable benchmarks and synthetic environments demonstrate that carefully selected privileged signals can match or outperform full-state asymmetric baselines while relying on strictly less state information.
更深还是更宽:从Sobolev损失下最优泛化误差的视角
Yahong Yang, Juncai He
AI总结 本文通过分析Sobolev损失下的最优泛化误差,比较了深层网络与宽层网络,揭示了样本点数量、网络参数和损失函数正则性对架构选择的影响,并应用于深度Ritz和PINN方法。
构建神经网络的架构是机器学习社区的一个具有挑战性的追求,而更深还是更宽的困境仍然是一个持续存在的问题。本文探讨了具有灵活层数的深层神经网络(DeNNs)与具有有限隐藏层的宽神经网络(WeNNs)之间的比较,重点关注它们在Sobolev损失下的最优泛化误差。分析研究表明,神经网络的架构可能受到多种因素的显著影响,包括样本点数量、神经网络内的参数以及损失函数的正则性。具体来说,更多的参数倾向于有利于WeNNs,而增加的样本点数量和损失函数的更大正则性则倾向于采用DeNNs。我们最终将该理论应用于使用深度Ritz和物理信息神经网络(PINN)方法求解偏微分方程,指导神经网络的设计。
Constructing the architecture of a neural network is a challenging pursuit for the machine learning community, and the dilemma of whether to go deeper or wider remains a persistent question. This paper explores a comparison between deeper neural networks (DeNNs) with a flexible number of layers and wider neural networks (WeNNs) with limited hidden layers, focusing on their optimal generalization error in Sobolev losses. Analytical investigations reveal that the architecture of a neural network can be significantly influenced by various factors, including the number of sample points, parameters within the neural networks, and the regularity of the loss function. Specifically, a higher number of parameters tends to favor WeNNs, while an increased number of sample points and greater regularity in the loss function lean towards the adoption of DeNNs. We ultimately apply this theory to address partial differential equations using deep Ritz and physics-informed neural network (PINN) methods, guiding the design of neural networks.
神经算子混合体降低算子学习中的主动复杂度
Anastasis Kratsios, Takashi Furuya, Jose Antonio Lara Benitez, Matti Lassas, Maarten de Hoop
AI总结 通过路由混合神经算子(MoNO)与固定单神经算子构造的比较,证明MoNO在主动专家规模上具有更优的深度、宽度和秩缩放,且对Lipschitz目标这些量以O(ε^{-1})为界。
算子学习系统并非仅由总参数数量决定;对于一次查询,相关瓶颈可能是必须加载和评估的模型。我们通过路由混合神经算子(MoNO)与固定单神经算子构造之间的建设性比较,在紧致Sobolev子集上研究了经典神经算子的这一区别。该比较涉及相对于基线的专家主动复杂度,其中总存储大小和路由搜索分别考虑。MoNO将每个输入函数通过树路由到一个专家。我们的主要定理表明,在近似集上,每个具有有界输出Sobolev半径的标量一致连续非线性算子都存在一个MoNO近似,其主动专家具有比所分析的单神经算子构造更小的深度、宽度和秩缩放;对于Lipschitz目标,这些专家量以$\mathcal{O}(\varepsilon^{-1})$为界。该定理将局部化转化为主动专家大小、路由深度和专家数量的算子级核算。我们还证明了底层神经算子架构的定量通用近似定理,明确依赖于紧集直径和连续模。
Operator-learning systems are not governed solely by total parameter count; for one query, the relevant bottleneck can be the model that must be loaded and evaluated. We study this distinction for classical neural operators on compact Sobolev subsets through a constructive comparison between routed mixtures of neural operators (MoNOs) and a fixed single-neural-operator construction. The comparison concerns expert-active complexity relative to that baseline, with total stored size and routing search accounted separately. A MoNO routes each input function through a tree to one expert. Our main theorem shows that every scalar uniformly continuous nonlinear operator with bounded output Sobolev radius on the approximation set admits a MoNO approximation whose active expert has smaller depth, width, and rank scaling than the analyzed single-neural-operator construction; for Lipschitz targets these expert quantities are bounded by $\mathcal{O}(\varepsilon^{-1})$. The theorem turns localization into an operator-level accounting of active expert size, routing depth, and number of experts. We also prove a quantitative universal approximation theorem for the underlying neural-operator architecture, with explicit dependence on compact-set diameter and modulus of continuity.
基于无穷范数的输入到状态稳定的长短期记忆网络:热系统视角
Stefano De Carli, Davide Previtali, Leandro Pitturelli, Mirko Mazzoleni, Antonio Ferramosca, Fabio Previdi
AI总结 本文提出基于无穷范数的输入到状态稳定性条件,改进LSTM网络稳定性,通过惩罚项和早停策略提升热系统建模性能,优于物理模型和GRU网络。
递归神经网络(RNNs)在系统辨识中表现出色,尤其在非线性动力学系统如热过程方面。然而,稳定性在实际应用中仍是一个关键挑战:尽管底层过程可能本质上是稳定的,但所得RNN模型可能无法保证捕捉这种行为。本文通过推导基于无穷范数(ISS∞)的输入到状态稳定性条件,解决了稳定性问题。所获得的条件依赖于比先前工作更少的网络参数。开发了ISS∞促进的训练策略,将惩罚项纳入损失函数以促进稳定性,并采用一种自定义的早停方法。通过热系统案例研究验证了训练后的LSTM模型质量,其中ISS∞促进的LSTM网络在性能上优于物理模型和ISS∞促进的门控循环单元(GRU)网络,同时优于非ISS∞促进的LSTM和GRU RNN。
Recurrent Neural Networks (RNNs) have shown remarkable performances in system identification, particularly in nonlinear dynamical systems such as thermal processes. However, stability remains a critical challenge in practical applications: although the underlying process may be intrinsically stable, there may be no guarantee that the resulting RNN model captures this behavior. This paper addresses the stability issue by deriving a sufficient condition for Input-to-State Stability based on the infinity-norm (ISS$_{\infty}$) for Long Short-Term Memory (LSTM) networks. The obtained condition depends on fewer network parameters compared to prior works. A ISS$_{\infty}$-promoted training strategy is developed, incorporating a penalty term in the loss function that encourages stability and an ad hoc early stopping approach. The quality of LSTM models trained via the proposed approach is validated on a thermal system case study, where the ISS$_{\infty}$-promoted LSTM outperforms both a physics-based model and an ISS$_{\infty}$-promoted Gated Recurrent Unit (GRU) network while also surpassing non-ISS$_{\infty}$-promoted LSTM and GRU RNNs.
MAD: 流形吸引扩散
Dennis Elbrächter, Giovanni S. Alberti, Matteo Santacesaria
AI总结 提出流形吸引扩散方法,利用流形假设通过扩展得分函数在推理阶段去除噪声,生成无噪声样本,在玩具问题、合成数据和真实数据上验证有效性。
基于得分的扩散模型是从图像分布中生成样本的一种高效方法。我们考虑训练数据来自目标分布的有噪声版本的情况,并提出一种可高效实现的推理过程修改,以生成无噪声样本。我们的方法受流形假设启发,该假设认为有意义的数据集中在高维环境空间的某个低维流形周围。核心思想是,噪声表现为离流形方向上的低幅度变化,而目标分布的相关变化主要限于流形方向。我们引入了扩展得分概念,并表明在简化设置中,它可以将小变化减少为零,同时基本保持大变化不变。我们描述了如何从标准得分的近似中高效计算其近似,并在玩具问题、合成数据和真实数据上展示了其有效性。
Score-based diffusion models are a highly effective method for generating samples from a distribution of images. We consider scenarios where the training data comes from a noisy version of the target distribution, and present an efficiently implementable modification of the inference procedure to generate noiseless samples. Our approach is motivated by the manifold hypothesis, according to which meaningful data is concentrated around some low-dimensional manifold of a high-dimensional ambient space. The central idea is that noise manifests as low magnitude variation in off-manifold directions in contrast to the relevant variation of the desired distribution which is mostly confined to on-manifold directions. We introduce the notion of an extended score and show that, in a simplified setting, it can be used to reduce small variations to zero, while leaving large variations mostly unchanged. We describe how its approximation can be computed efficiently from an approximation to the standard score and demonstrate its efficacy on toy problems, synthetic data, and real data.
线性回归中的风险比较:隐式正则化主导显式正则化
Jingfeng Wu, Peter L. Bartlett, Sham M. Kakade, Jason D. Lee, Bin Yu
AI总结 本文通过实例比较线性回归中梯度下降、岭回归和随机梯度下降的有限样本风险,发现梯度下降优于岭回归,但与随机梯度下降不可比,且在某些问题中梯度下降可能更差。
现有理论表明,对于按容量和源条件分类的线性回归问题,梯度下降(GD)始终是极小化最优的,而岭回归和在线随机梯度下降(SGD)对于某些类别的问题则是多项式次优的。超越极小化理论,本文为任何良好设定的线性回归问题提供了这些算法有限样本风险的实例比较。我们的分析得出三个关键发现。首先,GD 优于岭回归:在可比较的正则化下,GD 的过剩风险始终在岭回归的一个常数因子内,但即使经过最优调整,岭回归也可能多项式地更差。其次,GD 与 SGD 不可比。虽然已知对于某些问题 GD 可以多项式地优于 SGD,但反之亦然:我们受良性过拟合理论启发构造了问题,其中最优停止的 GD 多项式地更差。最后,对于一类重要子问题——具有快速且连续衰减协方差谱的问题,GD 优于 SGD,这包括所有满足标准容量条件的问题。
Existing theory suggests that for linear regression problems categorized by capacity and source conditions, gradient descent (GD) is always minimax optimal, while both ridge regression and online stochastic gradient descent (SGD) are polynomially suboptimal for certain categories of such problems. Moving beyond minimax theory, this work provides instance-wise comparisons of the finite-sample risks for these algorithms on any well-specified linear regression problem. Our analysis yields three key findings. First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of that of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems -- those with fast and continuously decaying covariance spectra -- which includes all problems satisfying the standard capacity condition.
训练后增强不变性
Keenan Eikenberry, Lizuo Liu, Yoonsang Lee
AI总结 提出训练后增强不变性框架,通过轻量级MLP适配器网络在预训练模型潜空间上实现近似不变性,无需微调且保持原始特征。
本文开发了一个训练后增强不变性的框架,其目标是为预训练网络添加不变性属性,同时不改变其在原始非增强输入分布上的行为。我们精确定义了这一概念,并引入了增强编码器,这是一种概率编码器,形式化了基于增强的编码过程,并作为我们的基本研究对象。我们提出了两种增强编码器的损失函数,即马尔可夫-瓦瑟斯坦最小化和瓦瑟斯坦相关性最大化,并通过实验证明,这两种损失函数可用于训练轻量级的单隐藏层MLP适配器网络$E_{\theta}$,当将其附加到预训练网络$F$的潜空间时,确实能实现(近似)训练后增强不变性。例如,在STL10上使用$F=\text{DINO}$特征时,复合网络$C\circ E_{\theta}\circ F$(其中$C$是线性分类器,$E_{\theta}$是我们提出的适配器网络之一)在任意旋转图像上达到94%的分类准确率,而没有适配器$E_{\theta}$的$C\circ F$网络则降至71%。类似地,我们可以将噪声不变分类结果从58%提升至86%。重要的是,我们无需微调即可获得这些结果($F$的权重全程冻结),并且我们的方法对原始特征的破坏很小,因为$E_{\theta}$在非增强潜分布上几乎等距作用。相比之下,我们展示了使用替代候选损失函数(特别是SimCLR和HSIC最大化)训练的适配器网络产生了不具竞争力的分类结果,并从根本上破坏了原始潜空间。代码见https://this URL。
This work develops a framework for post-training augmentation invariance, in which our goal is to add invariance properties to a pretrained network without altering its behavior on the original, non-augmented input distribution. We define this notion precisely and additionally introduce augmented encoders, which are probabilistic encoders that formalize augmentation-based encoding processes and that serve as our fundamental object of study. We introduce two losses for augmented encoders, namely, Markov-Wasserstein minimization and Wasserstein correlation maximization, and we demonstrate empirically that both losses can be used to train lightweight, one-hidden-layer MLP adapter networks $E_θ$ that, when appended to the latent space of a pretrained network $F$, do indeed lead to (approximate) post-training augmentation invariance. For example, on STL10 with $F=\text{DINO}$ features, the composite network $C\circ E_θ\circ F$, where $C$ is a linear classifier and where $E_θ$ is one of our proposed adapter networks, achieves 94% classification accuracy on arbitrarily rotated images, whereas a network of the form $C\circ F$ without the adapter $E_θ$ drops to 71% accuracy. Similarly, we can boost noise-invariant classification results from 58% up to 86%. Significantly, we obtain these results with no fine-tuning (the weights of $F$ remain frozen throughout), and our methods introduce little corruption to the original features, since $E_θ$ acts nearly isometrically on the non-augmented latent distribution. In contrast, we show that adapter networks trained with alternative candidate losses, specifically SimCLR and HSIC maximization, produce uncompetitive classification results and fundamentally corrupt the original latent space. Code available at https://github.com/keenan-eikenberry/augmentation_invariance
一个具有停止准则的 $(\epsilon,\delta)$-精确水平集估计
Hideaki Ishibashi, Kota Matsui, Kentaro Kutsukake, Hideitsu Hino
AI总结 提出一种带停止准则的水平集估计获取策略,理论上证明满足 $\epsilon$-精确度和 $1-\delta$ 置信水平,减少不必要的函数评估,实验验证了其有效性。
水平集估计问题旨在识别候选点集内未知且评估代价高昂的函数值超过指定阈值的区域,为全面评估函数值提供了一种高效替代方案。传统方法通常采用序列优化策略来寻找 $\epsilon$-精确解,该解允许在阈值轮廓周围留有余量,但往往缺乏有效的停止准则,导致过度探索和效率低下。本文引入了一种带有停止准则的水平集估计获取策略,确保算法在进一步探索不太可能带来改进时停止,从而减少不必要的函数评估。我们从理论上证明,该方法在 $1-\delta$ 的置信水平下满足 $\epsilon$-精确度,弥补了现有方法的一个关键空白。此外,我们表明这还带来了对 F-score 等性能指标下限的保证。数值实验表明,所提出的获取函数在达到与现有方法相当的精确度的同时,确认了停止准则在充分探索后有效终止算法。
The level set estimation problem seeks to identify regions within a set of candidate points where an unknown and costly to evaluate function's value exceeds a specified threshold, providing an efficient alternative to exhaustive evaluations of function values. Traditional methods often use sequential optimization strategies to find $ε$-accurate solutions, which permit a margin around the threshold contour but frequently lack effective stopping criteria, leading to excessive exploration and inefficiencies. This paper introduces an acquisition strategy for level set estimation that incorporates a stopping criterion, ensuring the algorithm halts when further exploration is unlikely to yield improvements, thereby reducing unnecessary function evaluations. We theoretically prove that our method satisfies $ε$-accuracy with a confidence level of $1 - δ$, addressing a key gap in existing approaches. Furthermore, we show that this also leads to guarantees on the lower bounds of performance metrics such as F-score. Numerical experiments demonstrate that the proposed acquisition function achieves comparable precision to existing methods while confirming that the stopping criterion effectively terminates the algorithm once adequate exploration is completed.
解决分离问题:纵向与时间-事件数据的Firth校正联合模型及其在职业培训辍学中的应用
Sophie Potts, Viola Deutscher, Elisabeth Bergherr
AI总结 针对联合模型中分类协变量分离导致估计偏差的问题,引入Firth校正到极大似然估计中,通过EM算法实现参数估计,模拟和实际数据表明该方法能降低偏差,并应用于德国职业培训辍学影响因素分析。
纵向与时间-事件数据的联合模型常用于建模内源性纵向协变量与时间-事件结局的关系。然而,该类模型继承了生存子模型的一些局限性,包括分类协变量每个类别必须非分离。因此,我们将Firth校正引入联合模型的频率学派估计过程,使模型类适用于存在分离情况的数据集。我们推导了校正项所需的量,并在联合模型的参数估计中将其实现于期望最大化算法。我们的模拟研究表明,在存在分离问题的数据情境下,Firth校正估计过程产生更少偏差的估计,且相应系数趋近于非分离情况下观察到的估计值。在关于职业培训满意度和辍学数据集上的应用展示了Firth校正联合模型在真实世界分离数据集中的优势。结果通过明确建模社会经济和培训特定因素对辍学风险的直接效应以及它们通过培训满意度的间接贡献,补充了德国职业培训辍学研究的文献。
Joint Models for longitudinal and time-to-event data are frequently used to model endogenous longitudinal covariates alongside a time-to-event outcome. However, the model class borrows some limitations of the survival submodels, including the necessity for non-separation for each category of categorical covariates. We therefore incorporate Firth's correction into the frequentist estimation procedure of joint models in order to make the model class applicable in settings with separation cases. We derive the needed quantities for the correction term and implement it in the Expectation-Maximization Algorithm for the parameter estimation in joint models. Our simulation study shows, that in data situations with separation issues, the Firth-corrected estimation procedure yields less biased estimates and the respective coefficients approach the estimated values observed in the non-separation cases. The application on a data set on satisfaction with and dropouts from vocational training demonstrates the advantages of the Firth-corrected joint model in a real world data set with separation. The results add to the literature on dropout from vocational training in Germany by explicitly modeling direct effects of socioeconomic and training-specific factors on the risk of dropout as well as their indirect contribution via satisfaction with the training.
纵向人体测量数据的二阶段插补与交叉参考协调:一项模拟研究
Flavia Alves
AI总结 提出一种二阶段方法,通过线性插补和基于LMS方法的生长参考插补,解决纵向数据中缺失的人体测量值,并显式处理不同参考标准,模拟显示误差小且无偏。
目标。纵向数据集经常缺失体重和身高测量值,而合并数据源的研究可能针对不同的生长参考标准(例如WHO参考和CDC图表)对测量值进行索引。我们描述并评估了一种可复现的二阶段方法,该方法在将参考标准的选择作为显式参数的同时,对缺失的人体测量数据进行插补。方法。阶段1在访视日期之间应用受试者内线性插值(仅内部间隙,无外推)。阶段2使用LMS方法,通过估计每个受试者的百分位数,在受试者内向前和向后携带该百分位数,当受试者从未被测量时默认使用第50百分位数,并从访视年龄的参考标准中读取期望值,从而从年龄和性别特异性生长参考中插补剩余值。可以为每个数据源提供不同的参考标准,以便记录和审计所应用的标准。我们通过掩盖并重新插补随机20%的观测值来评估恢复准确性。所有评估均使用计算机生成的合成数据。结果。在合成数据(n=60名受试者,288次访视,30%缺失)上,该方法将缺失率解决为100%完整。掩盖值恢复的体重平均绝对误差为1.78 kg(平均绝对百分比误差3.5%),身高为2.84 cm(2.0%),偏差可忽略。受试者内插值恢复的值比从生长参考恢复的值更准确,符合预期,支持二阶段顺序。结论。该方法提供了一种简单、无依赖且可审计的人体测量插补方法,显式处理不同的参考标准和每个值的来源。在用于实质性分析之前,下一步必要的工作是应用于实证数据并将插补不确定性传播到下游模型中。
Objective. Longitudinal datasets frequently contain missing weight and height measurements, and studies that combine data sources may index measurements against different growth reference standards (e.g., the WHO reference and CDC charts). We describe and evaluate a reproducible two-stage method that imputes missing anthropometry while making the choice of reference standard an explicit parameter. Methods. Stage 1 applies within-subject linear interpolation across visit dates (interior gaps only, no extrapolation). Stage 2 imputes remaining values from an age- and sex-specific growth reference using the LMS method by estimating each subject's centile, carrying it forward and backwards within the subject, defaulting to the 50th centile when a subject is never measured, and reading the expected value off the reference at the visit age. Different references can be supplied per data source so that the standard applied is recorded and auditable. We assessed recovery accuracy by masking and re-imputing a random 20% of observed values. All evaluations used computer-generated synthetic data. Results. On synthetic data (n = 60 subjects, 288 visits, 30% missing), the method resolved missingness to 100% completeness. Masked-value recovery gave a mean absolute error of 1.78 kg for weight (3.5% mean absolute percentage error) and 2.84 cm for height (2.0%), with negligible bias. Values recovered by within-subject interpolation were more accurate than those recovered from the growth reference, as expected, supporting the two-stage ordering. Conclusion. The method offers a simple, dependency-free, and auditable approach to anthropometric imputation, with explicit handling of differing reference standards and per-value provenance. Application to empirical data and propagation of imputation uncertainty into downstream models are the necessary next steps before use in substantive analyses.
从不完整的电子健康记录数据中的全人健康评分预测住院:一项案例研究
Grayson E. Weavil, Joseph Rigdon, Sarah C. Lotspeich
AI总结 本研究利用统计建模和机器学习,从不完整的电子健康记录中计算全因负荷指数(ALI),并评估其预测住院的能力,发现模式子模型方法在样本内表现最佳(AUC=0.73),但交叉验证效果较差(AUC=0.63)。
将标准化的全人健康测量嵌入电子健康记录(EHR)可能对预防性护理至关重要。全因负荷指数(ALI)由三个身体系统的十个压力源成分计算得出,提供了整体健康的有前景的快照。ALI可以从EHR数据计算,但许多成分缺失,因为并非所有患者都接受所有测试。使用统计建模和机器学习,来自大型学术健康系统的$1000$名患者的EHR数据被用于从ALI预测住院(作为计数或二元变量),并控制年龄和性别。评估了各种方法来填补患者缺失的ALI成分的信息空白,包括结合成分或单独使用它们的汇总度量。性能通过受试者工作特征(ROC)曲线和相应的ROC曲线下面积(AUC)来衡量。住院的计数建模并未优于二元建模,逻辑回归优于随机森林。总体而言,汇总度量表现相似,其中完整病例比例(即“不健康”的非缺失成分比例)表现最佳(AUC $= 0.64$),但差异$\leq 0.01$。当单独使用成分时,模式子模型方法在样本中最准确地预测了住院(AUC $= 0.73$),但交叉验证效果不佳(AUC $= 0.63$)。所有汇总度量表现相似。然而,当单独包含ALI成分时,为具有相同缺失数据模式的患者子集定制模型表现最佳。下一步包括实施EHR以实现预测并支持临床决策者大规模决策。
Embedding a standardized whole-person health measure in electronic health records (EHR) could be instrumental to preventative care. The allostatic load index (ALI), calculated from ten component stressors across three body systems, offers a promising snapshot of holistic health. The ALI can be calculated from EHR data, but many components are missing, since not all patients undergo all tests. Using statistical modeling and machine learning, EHR data for $1000$ patients from a large academic health system were used to predict in-patient hospitalization (as a count or binary) from ALI, controlling for age and sex. Various methods were evaluated to fill in information gaps for patients' missing ALI components, including summary measures combining components or using them separately. Performance was measured using receiver operating characteristic (ROC) curves and corresponding areas under the ROC curve (AUC). Count modeling of hospitalization did not improve upon binary, and logistic regression beat random forest. Overall, summary measures performed similarly, with the complete-case proportion (i.e., the proportion of non-missing components that were "unhealthy") performing best (AUC $= 0.64$) but by $\leq 0.01$. When using components separately, the pattern submodel approach most accurately predicted hospitalization (AUC $= 0.73$) in sample, but did not cross-validate as well (AUC $= 0.63$). All summary measures performed similarly. However, when including the ALI components separately, tailoring models to subsets of patients with the same missing data pattern performed best. Next steps include EHR implementation to enable prediction and support clinician decision-making at scale.
OncoTraj:EGFR突变非小细胞肺癌奥希替尼耐药纵向预测的公共基准
Abhijoy Sarkar, Aarchi Singh Thakur
发表机构 * Span AI
AI总结 针对EGFR突变非小细胞肺癌一线奥希替尼耐药预测缺乏公共基准的问题,提出OncoTraj基准,整合813名患者数据,定义三项任务,并发现单时间点组织NGS特征导致所有模型性能接近随机,而TP53共突变与进展率升高相关。
EGFR突变非小细胞肺癌(NSCLC)对一线奥希替尼的耐药是治疗压力下可预测克隆演化的典型例子,但目前尚无用于训练或评估相应纵向患者轨迹计算模型的公共基准。我们推出OncoTraj,这是一个来自三个真实世界临床基因组数据源(MSK-CHORD(672名患者)、AACR Project GENIE BPC NSCLC(34名患者)和FLAURA分子耐药补充(107名患者))的813名接受一线奥希替尼治疗的EGFR突变NSCLC患者的公共基准。OncoTraj定义了三个锁定任务:(A)固定12个月标志点的进展二元分类,(B)首次进展时间(天)的回归,以及(C)主要耐药机制的六类分类。我们发布了统一的数据集、经过审计的无泄漏保证的患者级训练/验证/测试划分、一个开源评估框架,以及六个参考基线,涵盖多数类预测器、逻辑回归、随机森林、XGBoost、LSTM和多任务Transformer。使用v1的单时间点快照特征,所有模型在干净的源内评估中均未超过随机水平:这种天花板在不同模型类别中的一致性表明限制在于输入模态(单快照组织NGS而非连续ctDNA),而非算法。该基准确实恢复了可重复的、与文献一致的关联:TP53共突变使整个队列的12个月进展率从29%提高到59%。OncoTraj建立了一个可重复、经泄漏审计的基线,并将模态限制转化为针对富集连续ctDNA的v2的具体设计要求。
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.
基于梯度提升与无分布覆盖的非酒精性脂肪肝病共形风险预测
Xinze Zhang
AI总结 提出结合梯度提升决策树与共形预测的机器学习框架Method,实现非酒精性脂肪肝病个体风险的无分布校准覆盖预测,在中国多中心队列中AUROC达0.912,优于多种方法。
非酒精性脂肪肝病(NAFLD)影响全球约25%的成年人,带来显著的肝脏和心血管风险。然而,人群层面的筛查工具仍不充分。我们提出Method,一种用于NAFLD风险预测的机器学习框架,将梯度提升决策树与共形预测相结合,以在个体风险估计上产生校准的、无分布的覆盖保证。它集成了基于互信息的稳定性选择过程,通过自助重采样识别紧凑、临床可解释的特征子集,构建预测集,其边际覆盖可证明超过用户指定的置信水平。我们在中国广州的多中心队列(主要n=2,187;外部验证n=412)上评估了Method,使用了涵盖人口统计学、代谢生物标志物和生活方式因素的78个候选特征。Method内部AUROC为0.912,外部为0.891,优于深度神经网络、TabNet、支持向量机和逻辑回归。共形预测集在90%名义水平下达到91.3%的经验覆盖。从这些分数得出的三层风险分层将人群分为不同组别,高风险亚组的12个月进展率是低风险组的4.7倍。选定的特征——特别是腰围、ALT、GGT、甘油三酯、空腹血糖和BMI——与已建立的代谢风险因素一致,提供了生物学合理性。
Non-alcoholic fatty liver disease (NAFLD) affects roughly 25% of global adults, posing substantial hepatic and cardiovascular risks. Yet, population-level screening tools remain inadequate. We present Method, a machine-learning framework for NAFLD risk prediction coupling gradient-boosted decision trees with conformal prediction to yield calibrated, distribution-free coverage guarantees on individual risk estimates. It integrates a mutual-information-based stability selection procedure to identify a compact, clinically interpretable feature subset via bootstrap resampling, constructing prediction sets whose marginal coverage provably exceeds a user-specified confidence level. We evaluated Method on a multicenter cohort from Guangzhou, China (primary n=2,187; external validation n=412) using 78 candidate features across demographics, metabolic biomarkers, and lifestyle factors. Method achieves an AUROC of 0.912 internally and 0.891 externally, outperforming deep neural networks, TabNet, support vector machines, and logistic regression. Conformal prediction sets achieve 91.3% empirical coverage at the 90% nominal level. A three-tier risk stratification derived from these scores separates the population into distinct groups, with the high-risk subgroup showing a 12-month progression rate 4.7 times that of the low-risk tier. The selected features -- notably waist circumference, ALT, GGT, triglycerides, fasting glucose, and BMI -- align with established metabolic risk factors, providing biological plausibility.
美国SARS-CoV-2变异株实时预测的协作估计与评估
Isaac MacArthur, Thomas Robacker, Bren Case, Spencer J. Fox, Dylan H. Morris, Evan L. Ray, Benjamin Rogers, Becky Sweger, Natalie M. Linton, John Huddleston, Andrew Magee, Zachary Susswein, Jover Lee, Trevor Bedford, Marlin D. Figgins, Ehsan Suez, Rajath Prabhakar, Tomas Leon, Brent Siegel, Mugdha Thakur, Christopher M. Hoover, Rahil Ryder, Jesse Elder, Michael Kupperman, Ruian Ke, Emma Goldberg, Sebastian Funk, Maryclare Griffin, Nicholas G. Reich, Kaitlyn E. Johnson
AI总结 本文介绍美国SARS-CoV-2变异株实时预测中心的构建,评估五种模型和基线模型在2024-2025年流感季的表现,发现基线模型整体表现良好,测序量低的地区模型性能波动更大。
估计和预测病原体变异动态的能力可以为公共卫生应对措施提供信息,包括规划传播性或严重性的增加、群体免疫的变化或疫苗或治疗有效性的改变。COVID-19大流行表明,通过病毒基因组测序监测SARS-CoV-2变异株演变的重要性,使得预测模型能够估计近期、现在和短期未来的变异频率。协作预测中心在大流行期间为集中预测病例、住院和死亡等流行病学指标提供了宝贵途径;然而,针对变异动态的预测中心尚不存在。本文讨论了美国SARS-CoV-2变异株实时预测中心的创建,该中心旨在收集美国州级指定SARS-CoV-2变异株相对丰度的估计值。我们讨论了构建该中心的设计决策和挑战及其评分程序。利用该中心首个呼吸道病毒季节(实时预测日期为2024年10月9日至2025年6月4日)的提交数据,我们评估了五个个体模型和一个基线模型。我们发现,基线模型(汇集全美序列)整体表现良好,大多数个体模型表现相似或略差。测序量较低的地区模型性能变异性更大。针对单个地点提交的模型优于针对所有地点提交的模型,这可能是由于本地数据的及时性和规模更大。关于不同变异出现阶段相对模型性能的许多问题仍有待研究,我们最后提出了该中心内外的未来方向。
The ability to estimate and predict pathogen variant dynamics can inform public health responses, including planning for increased transmission or severity, shifts in population immunity, or changes to vaccine or therapeutic effectiveness. The COVID-19 pandemic demonstrated the importance of monitoring SARS-CoV-2 variant evolution through viral genome sequencing, enabling predictive models to estimate variant frequencies in the recent past, present, and short-term future. Collaborative forecasting Hubs provided a valuable way to centralize predictive modeling of epidemiological indicators such as cases, hospitalizations, and deaths during the pandemic; however, none existed for variant dynamics. Here, we discuss the creation of the United States SARS-CoV-2 Variant Nowcast Hub, designed to solicit estimates of the relative abundance of a specified set of SARS-CoV-2 variants at the U.S. state level. We discuss the design decisions and challenges in building the Hub and its scoring procedures. Using submissions from the Hub's first respiratory virus season (nowcast dates October 9th, 2024 to June 4th, 2025), we evaluate five individual models and a baseline model. We found that the baseline model, which pools sequences across the U.S., performs well overall, with most individual models performing similarly or slightly worse. Locations with lower sequencing volumes exhibited greater variability in model performance. Models submitted for a single location outperformed those submitted for all locations, potentially due to greater timeliness and magnitude of local data. Much remains to be investigated regarding relative model performance across different phases of variant emergence, and we conclude by proposing future directions within and beyond this Hub.
新西兰呼吸道疾病的多病原体态势评估与预测
M. J. Plank, A. R. Young, K. L. Senior, R. J. Tobin, M. O'Hara-Wild, F. Callaghan, F. Shearer, O. Eales
AI总结 针对SARS-CoV-2、流感和RSV三种病原体,利用实时监测数据建立模型进行流行趋势评估和28天预测,为公共卫生规划提供支持。
实时分析流行趋势和预测有助于支持公共卫生规划和应对季节性呼吸道疾病。本文介绍了用于2025年新西兰冬季态势评估项目的两个模型,针对三种呼吸道病原体:SARS-CoV-2、流感和呼吸道合胞病毒(RSV)。SARS-CoV-2数据来自国家新冠监测系统;流感和RSV数据仅限于哨点医院监测项目。模型于2025年5月至10月每周运行,基于这些实时疾病监测数据,提供当前流行趋势的定量表示,以及流行增长率的估计和病例发病率的28天预测。模型结果和解释作为澳大利亚-新西兰流行病预测与分析联盟(ACEFA)运行的跨塔斯曼冬季项目的一部分,每周向公共卫生合作伙伴提供报告。我们将这些报告中包含的季中结果与季节完整数据的回顾性分析进行比较。结论是实时分析表现良好,并指出了未来冬季态势评估项目的一些改进领域。
Real-time analysis of epidemic trends and forecasts can help support public health planning and the response to seasonal respiratory disease. Here, we present two models that were used in a 2025 New Zealand winter situational assessment programme for three respiratory pathogens: SARS-CoV-2, influenza and respiratory syncytial virus (RSV). Data on SARS-CoV-2 were obtained from the national Covid-19 surveillance system; data on influenza and RSV were limited to a sentinel hospital surveillance programme. Models were run weekly from May to October 2025 on these real-time disease surveillance data and provided a quantitative representation of the current epidemic trend, along with estimates of the epidemic growth rate and 28-day ahead forecasts of case incidence. Model results and interpretation were provided in weekly reports to public health partners as part of a trans-Tasman winter programme run by the Australia--Aotearoa Consortium for Epidemic Forecasting and Analytics (ACEFA). We compare in-season results that were included in these reports to a retrospective analysis of the complete data for the season. We conclude that real-time analyses performed reasonably well, and identify some areas for improvement in future winter situational assessment programmes.
胜率比与联合脆弱模型在复发事件终点中的实证比较及其在肿瘤学和心脏病学中的应用
Adrien Orué, Derek Dinart, Laurent Billot, Carine Bellera, Virginie Rondeau
AI总结 比较联合脆弱模型(JFM)与末事件辅助复发事件胜率比(LWR)在复合终点分析中的性能,发现JFM在统计功效和推断可靠性上更优,而LWR提供方向性总结度量。
将复发性非致命事件与终末事件结合的复合终点在随机临床试验中日益常用,然而传统首次事件分析可能掩盖临床相关信息。我们比较了两种针对此类终点的统计框架:联合脆弱模型(JFM)和末事件辅助复发事件胜率比(LWR)。JFM通过共享脆弱性指定复发事件和终末事件的比例风险,产生经协变量调整的、特定组件的风险比,该风险比考虑了信息性复发和与死亡的依赖性。LWR是一种非参数、优先化的成对比较,它纳入随访期间观察到的所有事件,并在尊重死亡与复发之间预设层次的同时总结治疗的人群水平获益。我们首先使用改变伽马脆弱性方差和事件率的模拟评估了这些方法的性能。接着,我们通过肿瘤学和心脏病学中的两个临床应用实例说明了两种方法,强调了结论如何取决于治疗主要影响复发事件、死亡率还是两者。JFM提供了特定组件的估计,而LWR产生了具有方向性的治疗效应总结度量。JFM的系统性功效更高,因此成为推断和样本量估计最可靠的方法。LWR在方法学上的扩展,以适当处理删失和形式化因果估计量,仍是未来研究的有前景方向。
Composite endpoints that combine recurrent non-fatal events with a terminal event are increasingly used in randomized clinical trials, yet conventional time-to-first event analyses may obscure clinically relevant information. We compared two statistical frameworks tailored to such endpoints: the joint frailty model (JFM) and the last-event assisted recurrent-event win ratio (LWR). The JFM specifies proportional hazards for the recurrent and terminal events linked through a shared frailty, yielding covariate-adjusted, component-specific hazard ratios that account for informative recurrences and dependence with death. The LWR is a nonparametric, prioritized pairwise comparison that incorporates all observed events over follow-up and summarizes a population-level benefit of treatment while respecting a pre-specified hierarchy between death and recurrences. We first assessed the performance of the methods using simulations that varied both the gamma-frailty variance and the event rates. We next illustrated both approaches using two clinical application examples in oncology and cardiology, highlighting how conclusions depend on whether treatment primarily affects recurrent events, mortality, or both. The JFM provided component-specific estimates, while the LWR led to a summary measure of treatment effect with direction. Power was systematically improved with JFM, which thus appeared as the most reliable approach for inference and sample size estimation. Methodological extensions of the LWR to appropriately handle censoring and to formalize causal estimands remain a promising direction for future research.
测量老年人虚弱程度:基于超级分类器的指标
Sara Rebottini, Margherita Silan, Pietro Belloni
AI总结 提出一种基于行政医疗数据的复合指标,通过多结局逻辑分类器组合似然来量化老年人虚弱程度,允许灵活使用不同结局的虚弱决定因素。
识别老龄化人口中的虚弱老年人对于改善医疗服务至关重要。本研究提出了一种利用行政医疗数据评估个体虚弱水平的复合指标。鉴于虚弱的复杂性和多维性,采用了多结局方法。经过广泛的文献综述,选择一组不良健康事件作为虚弱的代理指标。这些事件使用逻辑分类器建模,以虚弱决定因素(与不良健康事件相关,通过梯度树提升选择)作为协变量。每个分类器的敏感性和特异性用于组成其组合似然。由此,我们推导出一个能够量化人群中虚弱程度的指标。该指标在多个结局和时间上表现出稳健的性能。其主要创新在于允许使用多样且结局特定的虚弱决定因素集,而无需任何结构约束。总体而言,我们提供了一个有效的工具来量化老年人的虚弱程度,可能支持卫生当局预防与虚弱相关的不良事件。
Identifying frail older adults in an ageing population is essential for improving healthcare services. This study proposes a composite indicator to assess individual frailty levels using administrative healthcare data. Given the complex and multidimensional nature of frailty, a multi-outcome approach is adopted. Following an extensive literature review, a set of adverse health events is selected as proxies for frailty. These events were modelled using logistic classifiers, with frailty determinants (associated to adverse health events, selected using a gradient tree boosting) serving as covariates. The sensitivity and specificity of each classifier is used to compose their combined likelihood. From this, we derive an indicator capable of quantifying frailty across the population. The indicator shows robust performance across multiple outcomes and over time. Its primary innovation lies in allowing the use of diverse and outcome-specific sets of frailty determinants without any structural constraint. Overall, we offer an effective tool for quantifying frailty among older adults, potentially supporting health authorities in the prevention of frailty-related adverse events.
基于微观模拟和Q学习的COVID-19加强针疫苗政策制定
Guoxuan Ma, Sicong Xie, Lili Zhao, Jian Kang
AI总结 提出结合表格Q学习与微观模拟的框架,利用RNN数字孪生环境安全学习疫苗政策,在COVID-19加强针政策中优于当前实践。
COVID-19大流行凸显了对有效疫苗政策的迫切需求,但传统临床试验往往缺乏足够的数据来捕捉全面公共卫生策略所需的多样化人群特征。大流行期间随机试验的伦理问题进一步使公共卫生政策制定复杂化。强化学习为疫苗政策制定提供了一种有前景的替代方案。然而,在现实场景中直接进行在线RL探索可能导致次优甚至有害的决策。本研究提出了一种新颖框架,将表格Q学习与微观模拟相结合,其中循环神经网络作为目标人群的数字孪生环境模拟器。该数字孪生体捕捉感染与患者特征之间的时间关联,以生成真实的个体疾病轨迹,从而在无需现实交互的情况下实现安全高效的政策学习。我们的表格Q学习模型生成一个可解释的政策表,平衡严重感染风险与疫苗接种副作用。应用于COVID-19加强针政策时,基于Q学习的政策优于当前实践,为更有效的疫苗接种策略提供了途径。介绍我们工作的项目网页,包括软件链接、简短介绍视频和逐步教程视频,可在以下网址获取:this https URL。
The COVID-19 pandemic highlighted the urgent need for effective vaccine policies, but traditional clinical trials often lack sufficient data to capture the diverse population characteristics necessary for comprehensive public health strategies. Ethical concerns around randomized trials during a pandemic further complicate policy development for public health. Reinforcement Learning (RL) offers a promising alternative for vaccine policy development. However, direct online RL exploration in real-world scenarios can result in suboptimal and potentially harmful decisions. This study proposes a novel framework combining tabular Q-learning with microsimulation, where a Recurrent Neural Network (RNN) serves as a digital twin environment simulator of the target population. This digital twin captures temporal associations between infection and patient characteristics to generate realistic individual disease trajectories, enabling safe and efficient policy learning without real-world interaction. Our tabular Q-learning model produces an interpretable policy table that balances the risks of severe infection against vaccination side effects. Applied to COVID-19 booster policies, the learned Q-learning-based policy outperforms current practices, offering a path toward more effective vaccination strategies. A project webpage introducing our work, including links to the software, a brief introductory video, and a step-by-step tutorial video, is available at https://public.websites.umich.edu/~jiankang/software/dtpl_website_umich/index.html.
新闻中女性的结构性低代表性:非参数贝叶斯混合模型捕捉时间依赖动态
Isabella Habereder, Thomas Kneib, Isao Echizen, Timo Spinde
AI总结 采用时间依赖贝叶斯混合模型分析加拿大新闻数据,揭示女性引述比例在所有主题和地区中均存在结构性低代表性,且超过85%的时间序列未见改善。
女性作为新闻媒体引用来源的低代表性是性别偏见的一种显著表现。理解性别偏见的集中区域及其演变方式对于有针对性的缓解至关重要。由于性别代表性随主题、时间和报道地区而变化,产生难以用参数化方法捕捉的复杂依赖关系,我们采用非参数模型来揭示潜在聚类结构和时间动态。我们将时间依赖贝叶斯混合建模技术与针对女性引述份额(介于0和1之间)的Beta混合核相结合。该模型拟合了2019年至2024年的加拿大新闻文章,揭示了所有聚类中女性的结构性低代表性,其中新闻主题对女性引述份额差异的影响比报道地区更强。超过85%的主题-地区时间序列在观察期内未显示向性别平等的改善。动态密度估计证实,女性引述份额的总体分布在2019年至2024年间保持稳定。我们的应用表明,高级概率模型不仅能复现性别偏见研究中的发现,还能揭示简单方法遗漏的潜在依赖关系和结构模式,鼓励未来采用基于模型的框架研究媒体偏见。
The under-representation of women as sources cited in news media is one prominent representation of gender bias. Understanding where gender bias concentrates and how it evolves is essential for targeted mitigation. Because gender representation varies across topics, time, and reported-on regions, creating complex dependencies that are difficult to capture parametrically, we employ a nonparametric model to uncover latent cluster structures and temporal dynamics. We combine time-dependent Bayesian mixture modeling techniques with a Beta mixture kernel tailored to female quote shares, bounded between 0 and 1. Fitted on Canadian news articles from 2019 to 2024, the model reveals structural under-representation of women across all clusters, with news topic driving differences in female quote shares more strongly than the reported-on region. More than 85% of topic-region time series show no improvement toward gender parity over the observation period. Dynamic density estimation confirms that the aggregate distribution of female quote shares remains stable between 2019 and 2024. Our application demonstrates that advanced probabilistic models not only reproduce findings in gender bias research but also reveal latent dependencies and structural patterns that simpler approaches miss, encouraging future adoption of model-based frameworks for studying media bias.
供应链中库存与信息控制的二项式平滑
Rene Caldentey, Avi Giloni, Clifford Hurvich, Prem Talwai, Yichen Zhang
AI总结 针对分散供应链中零售商订单平滑与上游预测的权衡,提出二项式平滑策略,在最小化制造商预测误差的同时保持可逆性,并实现常数因子近似最优。
在许多分散的供应链中,上游企业不直接观察市场需求,而是从订单流推断下游状况。因此,零售商的补货策略扮演双重角色:它管理库存补货并塑造上游预测可用的信息。这产生了一个基本权衡:更平滑的订单提高上游可预测性,但延迟对需求的响应可能增加下游库存成本。我们研究在一个由一个零售商和一个制造商组成的两层供应链中,当制造商根据零售商的订单历史预测未来订单时,零售商应如何最优地平滑需求。我们提出二项式平滑,一类补货策略,通过使用二项式权重将每个需求单位分散到有限时间范围内来实现延迟需求响应。该类策略可解释、易于校准且解析易处理。在满足温和正则条件的弱平稳高斯需求下,我们证明,对于任何固定平滑时间范围,在所有具有相同平滑程度的策略中,二项式策略最小化制造商的预测误差。它保持可逆性,因此制造商可以从观察到的订单中恢复需求历史。更一般地,二项式平滑相对于最优策略实现了常数因子近似保证。我们的结果产生更广泛的见解:补货策略的设计不应仅仅像传统牛鞭效应度量那样减少订单方差,而应减少订单的不可预测成分。精心设计的平滑可以提高供应链绩效并部分替代信息共享,为无需协作的协调提供具体机制。
In many decentralized supply chains, upstream firms do not observe market demand directly and instead infer downstream conditions from the order stream. A retailer's replenishment policy therefore plays a dual role: it governs inventory replenishment and shapes the information available for upstream forecasting. This creates a fundamental trade-off. Smoother orders improve upstream predictability, but delaying the response to demand can increase downstream inventory costs. We study how a retailer should optimally smooth demand in a two-tier supply chain with one retailer and one manufacturer when the manufacturer forecasts future orders from the retailer's order history. We propose Binomial Smoothing, a class of replenishment policies that implements delayed demand response by spreading each unit of demand over a finite horizon using binomial weights. The class is interpretable, easy to calibrate, and analytically tractable. Under weakly stationary Gaussian demand satisfying mild regularity conditions, we show that, for any fixed smoothing horizon, the Binomial policy minimizes the manufacturer's forecast error among all policies with the same degree of smoothing. It remains invertible, so the manufacturer can recover demand history from observed orders. More generally, Binomial Smoothing achieves a constant-factor approximation guarantee relative to an optimal policy. Our results yield a broader insight: replenishment policies should be designed not merely to reduce order variance, as in the traditional bullwhip measure, but to reduce the unpredictable component of orders. Carefully designed smoothing can improve supply-chain performance and partially substitute for information sharing, providing a concrete mechanism for coordination without collaboration.
利他主义在贴纸经济学中的力量:慷慨最小化集体成本,过度保护规范导致低效率
Luana Ferraz Alvarenga, Caetano Alvarenga Costa, César Rennó-Costa
AI总结 通过基于智能体的建模和蒙特卡洛模拟,研究社区规范如何影响FIFA世界杯贴纸收集的集体效率,发现过度保护策略增加成本,而慷慨策略优化网络流动性并显著减少不良运气的影响。
收集FIFA世界杯贴纸册呈现了一个经典的公共物品和集体行动困境,其中独自完成收集效率极低。为了评估本地社区规范如何塑造集体效率,我们使用基于智能体的建模和蒙特卡洛模拟,参数来自巴西纳塔尔交换聚会的实证现场观察。反映赛事最近扩军,Panini 2026专辑包含980张独立贴纸,包括68张金属特殊贴纸。我们对比标准基准经济(1:2特殊对普通交换比率)与过度保护严格策略(独家特殊对特殊交易)和利他慷慨策略(高级玩家放弃所需重复以帮助同伴)。我们的发现表明,过度保护规则困住流动性并导致网络范围的低效率。严格策略使中位数完成成本增加10包,并严重惩罚最不幸的5%收集者,在大城市增加20包,在小社区增加30包。相反,广泛慷慨优化网络流动性并显著压缩坏运气的长尾。引入慷慨策略使第5百分位的所需购买量在大规模配置中减少90包,在较小集群中减少130包。此外,广泛利他触发强烈的功能耦合,有效同步网络中的完成率。这项研究表明,僵化的保护规范降低集体福利,而慷慨成功缓解包抽方差,将昂贵的孤立爱好转变为有韧性、高效的公共物品。
Collecting the FIFA World Cup sticker album presents a classic public-goods and collective-action dilemma, in which completing a collection on one's own is highly inefficient. To evaluate how localized community norms shape collective efficiency, we use agent-based modeling and Monte Carlo simulations, parameterized with empirical field observations from exchange meetups in Natal, Brazil. Reflecting the tournament's recent expansion, the Panini 2026 album features 980 individual stickers, including 68 metallic specials. We contrast a standard baseline economy (1:2 special-to-normal exchange ratio) with an overprotective, strict strategy (exclusive special-for-special trading) and an altruistic, generous strategy (in which advanced players surrender needed duplicates to assist peers). Our findings reveal that overprotective rules trap liquidity and drive network-wide inefficiency. The strict strategy increases median completion costs by 10 packs and severely penalizes the least fortunate 5\% of collectors, adding 20 packs in large cities and 30 in small communities. Conversely, widespread generosity optimizes network liquidity and dramatically compresses the long tail of bad luck. Introducing the generous strategy reduces required purchases for the 5th percentile by 90 packs in large-scale configurations and 130 packs in smaller clusters. Furthermore, widespread altruism triggers a strong functional coupling that effectively synchronizes completion rates across the network. This study demonstrates that while rigid, protective norms degrade collective welfare, generosity successfully mitigates pack-draw variance, transforming an expensive, isolated hobby into a resilient, highly efficient public good.
保险对面临随机比例损失的贫困家庭的影响:贫困陷阱分析
Kira Henshaw, Jorge Ramirez, José Miguel Flores-Contró, Enrique A. Thomann, Sooie-Hoe Loke, Corina Constantinescu
AI总结 通过比例损失模型研究保险对贫困陷阱概率的影响,推导无保险时幂律分布下的新闭式解,以及有保险时均匀分布下的非局部微分方程,分析参数约束并数值计算陷阱概率。
如Kovacevic和Pflug(2011)所定义,陷阱概率$\psi$通过假设比例资本损失来建模,包括家庭未购买保险和购买保险两种情况。保险覆盖也是比例的,反映了实践中普遍且分析上便利的配额分享合同结构。在无保险情况下,当剩余资本比例服从幂律分布时,获得了$\psi$的新闭式公式,扩展了Kovacevic和Pflug(2011)的结果。当购买比例保险且剩余资本比例在$[0,1]$上均匀分布时,$\psi$满足一个非局部微分方程,其分析基于扩散过程的性质。该方程的非局部性质可以通过迭代求解方法处理,从而构造性地确定陷阱概率。在无保险和有保险两种情况下,推导了控制资本过程的参数约束,以防止陷阱的必然性。使用数值计算确定保险过程中的陷阱概率,并说明不同参数的影响。讨论了初始资本略高于贫困线的脆弱非贫困人口的陷阱概率后果。
The trapping probability, $ψ$, as defined in Kovacevic and Pflug (2011), is modelled by assuming proportional capital losses, both in the case where there is no insurance and in the case where insurance is purchased by the household. Insurance coverage is likewise proportional, mirroring the structure of quota-share contracts, which are both prevalent in practice and analytically convenient. New closed formulae for $ψ$ are obtained in the case of no insurance when the distribution of the remaining proportion of capital is a power law, extending the results in Kovacevic and Pflug (2011). When proportional insurance is acquired and the remaining proportion of capital is uniformly distributed on $[0,1]$, $ψ$ satisfies a non-local differential equation whose analysis is based on the properties of diffusion processes. The non-local nature of the equation can be addressed using iterative solution methods, leading to a constructive determination of the trapping probability. Constraints on the parameters governing the capital process are derived in both the uninsured and insured cases to prevent the certainty of trapping. Numerical calculations are used to determine the trapping probability for the insured process and to illustrate the impact of different parameters. Consequences on the trapping probability for vulnerable non-poor populations with initial capital slightly above the poverty line are discussed.
推进经验隐私审计的最新水平
Nicole Mitchell, Galen Andrew, Arun Ganesh, Brendan McMahan, Peter Kairouz
发表机构 * Google Research(谷歌研究院)
AI总结 提出通过高温采样生成合成金丝雀,用于经验隐私审计,并引入基于辅助模型的合成数据审计方法,系统研究模型容量与金丝雀熵对记忆化的交互影响。
大型语言模型的参数高效微调可能表现出对个别训练示例的问题性记忆。经验隐私审计(EPA)通过测量成员推断(MI)或重构攻击上的实际数据泄露来量化这种风险。EPA的一个关键挑战是设计与隐私敏感训练数据混合的“金丝雀”示例。我们提出通过从LLM中进行高温采样($T \geq 0.8$)生成合成金丝雀,使用针对隐私敏感训练数据定制的提示。这些金丝雀作为高影响异常值,确保高可识别性,从而实现强审计。此外,由于金丝雀本身是非私有的,它们是可检查的,并且可以重复插入,而不会危及真实数据的隐私。在隐私敏感数据上微调的模型的一个重要用途是生成合成数据。这也带来了隐私风险。我们引入了一种强大的合成数据审计方法,基于在合成数据上微调辅助模型。然后,对原始金丝雀的辅助模型进行审计,可以强有力地估计通过合成数据的隐私泄露。最后,利用我们强大的审计方法,我们系统研究了模型容量和金丝雀熵对记忆化的交互影响。
Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (MI) or reconstruction attacks. A key challenge in EPA is designing ``canary'' examples that are mixed with the privacy-sensitive training data. We propose generating synthetic canaries via high-temperature sampling ($T \geq 0.8$) from LLMs, using prompts tailored to the privacy-sensitive training data. These canaries act as high-influence outliers, ensuring high identifiability and hence strong audits. Further, since the canaries are themselves non-private, they are inspectable and can be inserted with repetition without jeopardizing the privacy of the real data. An important use of models fine-tuned on privacy-sensitive data is the generation of synthetic data. This also comes with privacy risk. We introduce a powerful synthetic data audit based on fine-tuning an auxiliary model on the synthetic data. Auditing the auxiliary model for the original canaries then provides a strong estimate of the privacy leakage through the synthetic data. Finally, leveraging our strong auditing methodologies, we perform a systematic investigation into the interacting effects of model capacity and canary entropy on memorization.
向训练数据注入噪声以校正测试集污染
Johnny Tian-Zheng Wei, Jerry Li, Ameya Godbole, Robin Jia
AI总结 提出通过以已知比例故意污染部分测试样本(注入噪声)来校正测试集污染导致的分数膨胀,并利用记忆预测器进行统计校正。
关于测试集污染的文献主要集中在检测上,但对污染测试分数的校正研究不足。我们的核心建议是通过以已知比例故意污染一些测试样本来向训练数据注入噪声。然后,这些注入的样本可用于校准模型记忆的预测器,从而实现对膨胀测试分数的原则性统计校正。为了评估不同的校正估计量,我们首先提出了一个基于Hubble模型的模拟框架。Hubble模型以最小对形式出现,其中扰动模型被故意用几个测试集污染,而标准模型则没有,作为反事实和校正目标。我们考虑使用来自记忆预测器、正确性预测器或两者的信息的估计量。在模拟中,我们建立了基本的统计直觉,并表明利用记忆和正确性信息的估计量优于不做任何校正的朴素估计。然后,我们实例化了几种记忆和正确性预测器,并发现简单的预测器(如Platt缩放的成员推理指标)为校正提供了良好的信号。最后,我们考察了注入噪声的实际考虑。简单的记忆预测器在校准时不需要超过10个样本,并且通常从一个数据集迁移到另一个数据集。综上所述,注入噪声是解决测试集污染的一种有前景的方法。
The literature on test set contamination largely focuses on detection, but the correction of contaminated test scores is underexplored. Our core proposal is to spike the training data by intentionally contaminating some test examples at known rates. The spiked examples can then be used to calibrate predictors of model memorization which enable principled statistical correction of inflated test scores. To evaluate different correction estimators, we first present a simulation framework based on the Hubble models. Hubble models come in minimal pairs, where the perturbed model was deliberately contaminated with several test sets, while the standard model was not, serving as the counterfactual and correction target. We consider estimators that use information from a memorization predictor, correctness predictor, or both. In simulation, we establish basic statistical intuitions and show that estimators leveraging memorization and correctness information are better than naive estimation which makes no correction at all. We then instantiate several memorization and correctness predictors, and find that simple predictors such as Platt-scaled membership inference metrics provide good signal for correction. Finally, we examine the practical considerations of spiking. Simple memorization predictors need no more than 10 examples for calibration and often transfer from one dataset to another. Taken together, spiking is a promising solution for test set contamination.
损失函数对称化以在存在噪声标签的情况下实现神经网络的鲁棒训练
Alexandre Lemire Paquin, Brahim Chaib-Draa, Philippe Giguère
AI总结 本文研究了通过将交叉熵损失对称化来设计鲁棒损失函数的方法,提出了一种多类对称损失函数,并展示了其在噪声标签下的有效性。
训练集的标注通常是昂贵且易出错的,因此设计对噪声具有鲁棒性的损失函数是一个重要的问题。对称条件为这种噪声的鲁棒性提供了理论保证。在本文中,我们研究了一种源自任何多类损失函数唯一分解为对称部分和类无关项的对称化方法。特别是,对交叉熵损失进行对称化会导致多类线性扩展的unhinged损失。与二分类情况不同,多类版本必须具有特定的系数才能满足对称条件。在适当假设下,我们证明这种多类unhinged损失是唯一的凸多类对称损失。我们还证明它在局部上具有根本作用:任何对称损失在具有相等分量的分数向量处的线性近似等价于多类unhinged损失。然后我们引入了SGCE和alpha-MAE两种损失函数,它们在多类unhinged损失和均值绝对误差之间进行插值,同时允许控制损失的beta-平滑性。在标准的噪声标签基准上的实验表明,其性能与现有的鲁棒损失函数相比具有竞争力。
Labeling a training set is often expensive and susceptible to errors, making the design of robust loss functions for label noise an important problem. The symmetry condition provides theoretical guarantees for robustness to such noise. In this work, we study a symmetrization method arising from the unique decomposition of any multi-class loss function into a symmetric component and a class-insensitive term. In particular, symmetrizing the cross-entropy loss leads to a linear multi-class extension of the unhinged loss. Unlike in the binary case, the multi-class version must have specific coefficients in order to satisfy the symmetry condition. Under suitable assumptions, we show that this multi-class unhinged loss is the unique convex multi-class symmetric loss. We also show that it has a fundamental local role: the linear approximation of any symmetric loss around score vectors with equal components is equivalent to the multi-class unhinged loss. We then introduce SGCE and alpha-MAE, two loss functions that interpolate between the multi-class unhinged loss and the Mean Absolute Error while allowing control of the beta-smoothness of the loss. Experiments on standard noisy-label benchmarks show competitive performance compared with existing robust loss functions.
ClusBench:你一直期待的聚类基准测试数据资源(?)
David P. Hofmeyr
AI总结 本文通过拟合灵活的非参数分布,从200多个公开数据集生成近3000个合成数据集,用于大规模聚类方法评估,保留真实数据细微差别。
尽管存在一些非常常见的测试平台用于评估聚类方法的性能,但大规模基准测试通常局限于相对简单的模拟设置。在这里,我们描述了近3000个合成数据集的生成和整理,这些数据集源自200多个公开可用的数据集;其中大多数来自实际应用。通过为每个基础数据集拟合灵活的非参数分布,我们能够保留真实数据中许多难以在标准模拟中重现的细微差别,同时生成的数据集的大小有时远大于它们所源自的数据集。合成数据集以及附带的R包可从该https URL下载。
Although some very common test beds exist for assessing the performance of clustering methods, large scale benchmarking is typically limited to relatively simplistic simulation set-ups. Here we describe the production and curation of close to 3000 synthetic data sets, derived from more than 200 publicly available data sets; the majority of which arose from real-world applications. By fitting a flexible non-parametric distribution to each base data set we are able to retain much of the nuance in real-world data which is difficult to reproduce in standard simulations, while also producing data sets whose sizes are sometimes substantially greater than the data sets from which they are derived. The synthetic data sets, plus an accompanying R package, are available for download from https://github.com/DavidHofmeyr/ClusBench.
不相交还是重叠?基于重构的时间序列异常检测中的推理窗口化
Guillaume Coulaud, Reza Akbarinia, Florent Masseglia
AI总结 研究推理步长(重叠窗口)对基于重构的时间序列异常检测性能的影响,提出统一评估协议,实验表明重叠窗口平均提升28%且改变方法排名。
基于重构的方法广泛用于时间序列异常检测,其中模型被训练来重构子序列,并通过重构误差识别异常。然而,由于异构的评估实践和不明确的推理过程,报告的结果往往难以比较。在本文中,我们重新审视单变量离线设置下的基于重构的异常检测,并研究推理步长的作用,该步长控制子序列是作为不相交窗口处理还是重叠处理。我们在精心策划的TSB-AD基准上提出了一个统一的训练、调优和多种子评估协议,并研究了重叠推理如何影响一系列重构模型的异常检测性能,包括基于PCA的基线、DLinear、AutoEncoder、TimesNet和Transformer变体。结果表明,在所有模型中,重叠窗口带来一致的改进,平均相对增益高达+28%,并且可以改变方法排名。我们进一步分析了跨数据集、随机种子和超参数配置的变异性。最后,我们使用与滑动窗口重构对齐的定位标准,在完整的UCR存档上补充了基准研究。总体而言,我们的结果强调,基于重构的异常检测性能不仅取决于模型架构和训练,还取决于推理选择,这促使采用清晰且可重复的协议。我们的结果表明,基于重构的基线在TSB-AD和UCR基准上都取得了强劲的性能,支持它们作为单变量时间序列异常检测的竞争性和实用方法。
Reconstruction-based methods are widely used for time series anomaly detection, where models are trained to reconstruct subsequences, and anomalies are identified through reconstruction errors. However, reported results are often hard to compare due to heterogeneous evaluation practices and underspecified inference procedures. In this paper, we revisit reconstruction-based anomaly detection in the univariate offline setting and study the role of the inference stride, which controls whether subsequences are processed as disjoint windows or with overlap. We propose a unified training, tuning, and multi-seed evaluation protocol on the curated TSB-AD benchmark, and study how overlapping inference affects anomaly detection performance for a range of reconstruction models, including PCA-based baselines, DLinear, an AutoEncoder, TimesNet, and Transformer variants. The results show that across all models, overlapping windows yield consistent improvements, with average relative gain up to +28%, and can alter method rankings. We further analyze variability across datasets, random seeds, and hyperparameter configurations. Finally, we complement the benchmark study with an evaluation on the full UCR archive using localization criteria aligned with sliding-window reconstruction. Overall, our results highlight that reconstruction-based anomaly detection performance depends not only on model architecture and training, but also on inference choices, motivating a clear and reproducible protocol. Our results show that reconstructionbased baselines achieve strong performance on both TSB-AD and UCR benchmarks, supporting them as competitive and practical approaches for univariate time series anomaly detection.
TorchKM:面向GPU的核学习与模型选择库
Yikai Zhang, Gaoxiang Jia, Jie Ding, Boxiang Wang
发表机构 * University of Iowa ; University of Minnesota
AI总结 提出GPU加速的核学习库TorchKM,通过智能复用矩阵运算加速SVM、核逻辑回归等模型的训练与模型选择,性能优于标准基线。
TorchKM是一个用于核机器的开源库,包括支持向量机、核逻辑回归和核分位数回归,并具有GPU加速。该库采用scikit-learn风格的API,旨在利用GPU友好的线性代数,通过智能复用矩阵运算加速完整的训练和模型选择流程。基准测试显示,与标准基线相比,具有竞争力的预测性能以及显著的加速效果。代码和文档可在https://this URL获取,并且该包可以通过PyPI轻松安装。
TorchKM is an open-source library for kernel machines, including support vector machines, kernel logistic regression, and kernel quantile regression, with GPU acceleration. The library features a scikit-learn-style API and is designed to exploit GPU-friendly linear algebra, accelerating the full training and model-selection pipeline through intelligent reuse of matrix operations. Benchmarks show competitive predictive performance with substantial speedups over standard baselines. The efficiency and programmable design also make TorchKM a kernel-learning component for AI-driven workflows. Code and documentation are available at https://github.com/YikaiZhang95/torchkm, and the package can be easily installed via PyPI.
mlr3mbo:R语言中的贝叶斯优化
Marc Becker, Lennart Schneider, Martin Binder, Lars Kotthoff, Bernd Bischl
AI总结 介绍mlr3mbo,一个模块化的R语言贝叶斯优化工具箱,支持单/多目标优化、多提议、并行化,并通过坐标下降搜索和基准测试验证其性能与现有优化器相当。
我们提出mlr3mbo,一个用于R语言中贝叶斯优化的模块化工具箱。mlr3mbo支持单目标和多目标优化、多点提议、批量与异步并行化以及稳健的错误处理。虽然它可用于许多标准贝叶斯优化变体的应用场景,但研究人员也可以从其灵活的构建块中构建自定义贝叶斯优化算法。除了介绍软件、设计原则和构建块外,本文还在基于代理的基准套件YAHPO Gym上进行了两次广泛的实证评估。为了识别数值和混合层次优化场景下的稳健默认配置,并进一步了解各个设置的各自影响,我们在mlr3mbo配置空间上运行坐标下降搜索并分析其结果。此外,我们将mlr3mbo与包括HEBO、SMAC3、Ax和Optuna在内的多种现有优化器进行基准测试,发现其性能与最新技术相当。
We present mlr3mbo, a modular toolbox for Bayesian optimization in R. mlr3mbo supports single- and multi-objective optimization, multi-point proposals, batch and asynchronous parallelization, and robust error handling. While it can be used for many standard Bayesian optimization variants in applied settings, researchers can also construct custom Bayesian optimization algorithms from its flexible building blocks. In addition to an introduction to the software, its design principles, and its building blocks, the paper presents two extensive empirical evaluations on the surrogate-based benchmark suite YAHPO Gym. To identify robust default configurations for both numeric and mixed-hierarchical optimization regimes, and to gain further insights into the respective impacts of individual settings, we run a coordinate descent search over the mlr3mbo configuration space and analyze its results. Furthermore, we benchmark mlr3mbo against a wide range of established optimizers, including HEBO, SMAC3, Ax, and Optuna, and find that it performs on par with state-of-the-art.
量化AI可见性的不确定性:生成式搜索测量的统计框架
Ronald Sielinski
AI总结 针对AI生成式搜索中可见性测量的随机性问题,提出将引用指标视为样本估计量,通过重复采样和Bootstrap置信区间揭示测量噪声,并给出样本量建议。
AI驱动的答案引擎本质上是不确定性的:在不同时间提交相同的查询可能会产生不同的响应并引用不同的来源。尽管存在这种随机行为,当前测量生成式搜索中领域可见性的方法通常依赖于单次运行的引用份额和普遍性的点估计,隐含地将其视为固定值。本文认为,引用可见性指标应被视为底层响应分布的样本估计量,而非固定值。我们通过三个生成式搜索平台——Perplexity Search、OpenAI SearchGPT和Google Gemini——对三个消费品主题进行重复采样,实证研究了引用变异性。采用了两种采样方案:连续九天的每日收集和十分钟间隔的高频采样。我们表明,引用分布遵循幂律形式,并在重复样本间表现出显著变异性。Bootstrap置信区间显示,许多领域间的明显差异落在测量过程的噪声基底内。全分布排名稳定性分析进一步表明,引用排名在样本间不稳定,不仅限于排名靠前的领域,而且在频繁引用的领域集中也是如此。这些发现表明,单次运行的可见性指标提供了对生成式搜索中领域性能的误导性精确描述。我们认为,必须附带不确定性估计报告引用可见性,并为实现可解释置信区间所需的样本量提供实用指导。
AI-powered answer engines are inherently non-deterministic: identical queries submitted at different times can produce different responses and cite different sources. Despite this stochastic behavior, current approaches to measuring domain visibility in generative search typically rely on single-run point estimates of citation share and prevalence, implicitly treating them as fixed values. This paper argues that citation visibility metrics should be treated as sample estimators of an underlying response distribution rather than fixed values. We conduct an empirical study of citation variability across three generative search platforms--Perplexity Search, OpenAI SearchGPT, and Google Gemini--using repeated sampling across three consumer product topics. Two sampling regimes are employed: daily collections over nine days and high-frequency sampling at ten-minute intervals. We show that citation distributions follow a power-law form and exhibit substantial variability across repeated samples. Bootstrap confidence intervals reveal that many apparent differences between domains fall within the noise floor of the measurement process. Distribution-wide rank stability analysis further demonstrates that citation rankings are unstable across samples, not only among top-ranked domains but throughout the frequently cited domain set. These findings demonstrate that single-run visibility metrics provide a misleadingly precise picture of domain performance in generative search. We argue that citation visibility must be reported with uncertainty estimates and provide practical guidance for sample sizes required to achieve interpretable confidence intervals.
基于R包outstandR的人群调整间接比较
Nathan Green
AI总结 针对缺乏头对头试验时的间接治疗比较,提出R包outstandR,通过G计算和多重插补边际化方法实现人群调整,提供统一框架进行稳健证据合成。
间接治疗比较(ITC)在缺乏头对头临床试验时对卫生技术评估(HTA)至关重要。当试图将具有可用个体患者数据(IPD)的治疗与仅报告汇总水平数据(ALD)的竞争者进行比较时,常见挑战是试验人群在效应修饰因子上的差异。虽然存在如匹配调整间接比较(MAIC)等方法来调整这些跨试验差异,但它们正逐渐被基于回归的边际化方法所取代。历史上,这些方法的软件实现常常分散或范围有限。本文介绍了outstandR,一个旨在为人群调整间接比较(PAIC)提供全面统一框架的R包。outstandR实现了先进的G计算方法——在最大似然和贝叶斯框架内,以及多重插补边际化(MIM)来解决非可折叠性问题。通过简化协变量模拟、模型标准化和对比估计的工作流程,outstandR能够在复杂的决策场景中实现稳健且兼容的证据合成。
Indirect treatment comparisons (ITCs) are essential in Health Technology Assessment (HTA) when head-to-head clinical trials are absent. A common challenge arises when attempting to compare a treatment with available individual patient data (IPD) against a competitor with only reported aggregate-level data (ALD), particularly when trial populations differ in effect modifiers. While methods such as Matching-Adjusted Indirect Comparison (MAIC) exist to adjust for these cross-trial differences, they are increasingly being superseded by regression-based marginalization methods. Historically, software implementations for these methods have often been fragmented or limited in scope. This article introduces outstandR, an R package designed to provide a comprehensive and unified framework for population-adjusted indirect comparison (PAIC). outstandR implements advanced G-computation methods - within both maximum likelihood and Bayesian frameworks, and Multiple Imputation Marginalization (MIM) to address non-collapsibility. By streamlining the workflow of covariate simulation, model standardization, and contrast estimation, outstandR enables robust and compatible evidence synthesis in complex decision-making scenarios.
ChartAgent: 一种用于复杂图表问答中视觉基础推理的多模态智能体
Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso
AI总结 提出ChartAgent框架,通过迭代分解查询为视觉子任务并利用图表专用视觉工具(如绘制注释、裁剪区域)进行空间域推理,在ChartBench和ChartX上取得最先进性能,尤其对无标注图表提升显著。
最近的多模态大语言模型在基于图表的视觉问答中显示出潜力,但在无标注图表上——即那些需要精确视觉解释而非依赖文本捷径的图表——其性能急剧下降。为了解决这个问题,我们引入了ChartAgent,一种新颖的智能体框架,它直接在图表的空间域内显式执行视觉推理。与文本思维链推理不同,ChartAgent通过专门的行动(如绘制注释、裁剪区域(例如分割饼图切片、隔离条形图)和定位坐标轴)迭代地将查询分解为视觉子任务,并主动操作和交互图表图像,使用图表专用视觉工具库来完成每个子任务。这种迭代推理过程密切模仿了人类理解图表的认知策略。ChartAgent在ChartBench和ChartX基准测试上达到了最先进的准确率,整体上比先前方法绝对提升高达16.07%,在无标注、数值密集的查询上提升17.31%。此外,我们的分析表明,ChartAgent (a) 在多种图表类型上有效,(b) 在不同视觉和推理复杂度水平上均取得最高分数,(c) 作为一个即插即用的框架,提升了多种基础LLM的性能。我们的工作是首批使用工具增强的多模态智能体展示图表理解中视觉基础推理的工作之一。
Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.
梯度引导的最远点采样用于鲁棒训练集选择
Morris Trestman, Stefan Gugler, Felix A. Faber, O. A. von Lilienfeld
AI总结 提出梯度引导最远点采样(GGFPS),利用分子力范数指导构型空间采样,在MD17数据集上相比FPS和随机采样显著提升数据效率和模型鲁棒性。
训练集采样方法用于提高机器学习问题中与化学相关的模型性能并降低数据成本。我们引入了梯度引导最远点采样(GGFPS),这是最远点采样(FPS)的一个简单扩展,利用分子力范数指导分子构型空间的高效采样。针对一个玩具系统(Styblinski-Tang函数)以及来自MD17数据集的分子动力学轨迹,提供了数值证据。我们的数值结果表明,与FPS、均匀随机采样(URS)以及已有的监督式FPS风格选择器PCov-FPS和PCov-CUR相比,使用GGFPS时数据效率和模型鲁棒性更优。对MD17数据的分布分析表明,FPS系统性地欠采样平衡几何结构,导致松弛结构测试误差较大。GGFPS纠正了这一缺陷,并且(i)在二维Styblinski-Tang系统中,与FPS相比,在不牺牲预测精度的情况下,训练成本可降低两倍;(ii)系统性地降低了MD17中平衡以及应变结构的预测误差;(iii)在所有MD17构型空间中系统性地降低了预测误差方差。这些结果表明,梯度感知采样方法作为有效的训练集选择工具具有很大潜力,而简单使用FPS可能导致训练不平衡和预测结果不一致。
Training set sampling methods are used to improve model performance and lower data costs in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Furthest Point Sampling (FPS) that leverages molecular force norms to guide efficient sampling of configurational spaces of molecules. Numerical evidence is presented for a toy system (the Styblinski-Tang function) as well as for molecular dynamics trajectories from the MD17 dataset. Our numerical results indicate superior data efficiency and model robustness when using GGFPS compared to FPS and uniform random sampling (URS), as well as established supervised FPS-style selectors, PCov-FPS and PCov-CUR. Distribution analysis of the MD17 data suggests that FPS systematically under-samples equilibrium geometries, resulting in large test errors for relaxed structures. GGFPS cures this artifact and (i) enables up to twofold reductions in training cost without sacrificing predictive accuracy compared to FPS in the 2-dimensional Styblinski-Tang system, (ii) systematically lowers prediction errors for equilibrium as well as strained structures in MD17, and (iii) systematically decreases prediction error variances across all of the MD17 configuration spaces. These results suggest that gradient-aware sampling methods hold great promise as effective training set selection tools, and that naive use of FPS may result in imbalanced training and inconsistent prediction outcomes.
正交Procrustes问题在合成数据中保持相关性
Oussama Ounissi, Nicklas Jävergård, Assaad Zeghina, Adrian Muntean
AI总结 提出基于正交Procrustes问题的轻量级后处理方法,恢复合成表格数据的Pearson相关结构,同时保持特征分布和下游任务性能。
合成数据生成越来越多地用于涉及隐私保护、数据共享和数据稀缺的应用中。在许多情况下,保留原始数据的依赖结构是核心关注点。在这项工作中,我们提出了一种基于正交Procrustes问题的轻量级后处理方法,用于合成表格数据。从已生成的合成数据集开始,我们的方法构建了恢复原始数据Pearson相关结构的最接近数据集。在理论方面,我们证明了保留Pearson相关等价于中心化数据子空间中的线性正交映射作用,然后部署了正交Procrustes问题。然而,为了使其成立,我们首先建立了一个结果,确保在适当假设下应用正交Procrustes步骤仍保持在上述子空间中。对多个数据集和合成数据生成器的应用说明了所提出方法的有效性。特别是,数值实验表明,可以在很大程度上保留个体特征分布、数据几何形状和下游分类任务性能的同时恢复相关结构。
Synthetic data generation is increasingly used in applications involving privacy preservation, data sharing, and data scarcity. In many situations, preserving the dependence structure of the original data is of central interest. In this work, we propose a lightweight postprocessing methodology for synthetic tabular data based on the Orthogonal Procrustes problem. Starting from an already generated synthetic dataset, our approach constructs the closest dataset that restores the Pearson correlation structure of the original data. On the theoretical side, we show that preserving Pearson correlation is equivalent to the action of linear orthogonal maps in the centered-data subspace, and then deploy the Orthogonal Procrustes problem. However, in order for this to hold, we first establish a result ensuring that applying the Orthogonal Procrustes step remains in the aforementioned subspace under suitable assumptions. Applications to several datasets and synthetic data generators illustrate the effectiveness of the proposed approach. In particular, the numerical experiments indicate that the correlation structure can be restored while largely preserving the individual feature distributions, the geometry of the data, and the performance of downstream classification tasks.
LLM自动化叙事中的缺陷
George Perrett, Javae Elliott, Jennifer Hill, Marc Scott
AI总结 通过编写代码完成数据分析任务的新基准测试,发现前沿LLM在平均性能、方差和错误幅度上均不如人类专家,挑战了LLM达到人类专家水平的说法。
大型语言模型(LLM)越来越多地被描述为在知识经济任务上达到人类专家水平。这些说法主要基于LLM在标准化数据集上衡量平均性能的基准测试任务中的表现。许多基准测试任务的主要局限性在于,它们通常基于直接包含在LLM训练数据中的内容来衡量性能,并且经常不评估LLM性能的可靠性或LLM错误的幅度。然而,在高风险情境中,这些品质至关重要。通过一项需要编写计算机代码完成数据分析任务的新型LLM基准测试,我们将前沿LLM的性能与人类专家的提交进行了比较,并明确测量了响应的方差和错误的幅度。我们的研究表明,人类专家在一系列指标上平均表现更好,并且表现出更小的性能变异性。我们的结果提供了证据,表明LLM并非始终如一地达到人类专家的水平,并证明了在LLM基准评估中测量方差和评估错误幅度的重要性。
Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance across standardized datasets. Primary limitations of many benchmarking tasks are that they often measure performance based on content directly included in LLM training data, and they frequently do not assess the reliability of LLM performance or the magnitude of LLM errors. However, in high stakes contexts, these qualities are critically important. Through a novel LLM benchmarking task that requires writing computer code to complete a data analysis task, we compare the performance of a frontier LLM against submissions from human experts and explicitly measure the variance of responses and the magnitude of errors. Our study reveals that the human experts perform better on average on a range of metrics and demonstrate less variability in performance. Our results provide evidence that LLMs do not consistently perform at the level of human experts and demonstrate the importance of measuring variance and assessing error magnitude in LLM benchmark evaluations.
固定阈值一位Toeplitz协方差估计在稀疏尺采样下
Zhiyong Cheng, Shengyao Chen
AI总结 研究固定阈值一位量化结合确定性稀疏尺采样时的Toeplitz协方差估计,提出中心化稀疏尺Toeplitz估计器并证明维度无关的高斯方差收缩定理,在平衡覆盖几何下达到极小化最优率。
我们研究当固定阈值一位量化与确定性稀疏尺采样结合时的Toeplitz协方差估计。每个观测比特可以进入多个滞后乘积。在非零阈值下,符号具有非零均值,这种确定性顶点重用使得原始符号乘积具有一致的单一顶点分量。该分量改变了方差几何。原始非零阈值乘积由加权度行和而非滞后覆盖或边Frobenius几何控制。中心化符号去除了顶点分量,留下退化的稀疏对统计量。然后我们证明了有界坐标变换的空心二次型的维度无关高斯方差收缩定理。该定理适用于硬阈值符号,并通过边权重的Frobenius范数控制任意确定性稀疏支撑,与维度、支撑大小或最大度无关。对于算子范数估计,我们构建了具有池化边际校准的中心化稀疏尺Toeplitz估计器。领先的oracle项为\\[ \gamma_0 L_1\kappa_{\rm obs} \sqrt{\frac{\varphi(\Omega)\log d}{n}}, \qquad \varphi(\Omega)=\sum_{s=1}^{d-1}q_s^{-1}, \\] 而插件项由边际比特预算\\(n|\Omega|\\)控制。在已知尺度恒等邻域子模型中的一个真实谱打包下界表明,在平衡覆盖几何下,\\(\sqrt{\varphi(\Omega)\log d/n}\\)依赖性是固有的。在非饱和区域中,当该覆盖项占主导时,oracle估计器在子模型上达到极小化最优率;关于条件数、曲率和插件校准常数的优化依赖性留待进一步研究。
We study Toeplitz covariance estimation when fixed-threshold one-bit quantization is combined with deterministic sparse-ruler sampling. Each observed bit can enter many lag products. At a nonzero threshold the signs have nonzero mean, and this deterministic vertex reuse gives raw sign products a coherent one-vertex component. This component changes the variance geometry. Raw nonzero-threshold products are governed by weighted-degree row sums rather than by lag coverage or edge Frobenius geometry. Centering the signs removes the vertex component and leaves a degenerate sparse-pair statistic. We then prove a dimension-free Gaussian variance contraction theorem for hollow quadratic forms of bounded coordinate transforms. The theorem applies to hard threshold signs and controls arbitrary deterministic sparse supports by the Frobenius norm of the edge weights, with no dependence on dimension, support size or maximum degree. For operator-norm estimation, we construct centered sparse-ruler Toeplitz estimators with pooled marginal calibration. The leading oracle term is \[ γ_0 L_1κ_{\rm obs} \sqrt{\frac{φ(Ω)\log d}{n}}, \qquad φ(Ω)=\sum_{s=1}^{d-1}q_s^{-1}, \] while the plug-in term is governed by the marginal bit budget \(n|Ω|\). A real spectral-packing lower bound in a known-scale identity-neighborhood submodel shows that the \(\sqrt{φ(Ω)\log d/n}\) dependence is intrinsic under balanced coverage geometry. In the non-saturated regime where this coverage term dominates, the oracle estimator is therefore minimax rate optimal over the submodel; the optimal dependence on the conditioning, curvature and plug-in calibration constants is left open.
鲁棒检验中Chernoff与凸序最优性之间的结构性分离
Gökhan Gül
AI总结 本文揭示在鲁棒假设检验中,最不利分布同时最大化所有Chernoff u-亲和性并最小化所有f-散度的等价性在一般情形下失效,通过构造三点概率空间上的反例证明该分离,并给出等价成立的充分条件。
在经典鲁棒假设检验中,最不利分布通常同时最大化所有Chernoff $u$-亲和性并最小化所有$f$-散度。本文识别了导致这种等价性在一般情况下失效的结构性机制:由分数幂函数$\{x^u\}_{u\in(0,1)}$生成的锥严格小于凸函数的锥,从而在分数矩占优与凸序占优之间产生分离。在三点概率空间上构造了一个显式的最小反例,其中不确定性类是凸的、紧的,且似然比一致有界,使得单个分布对均匀地最大化所有Chernoff泛函,但未能最小化某个凸$f$-散度。进一步证明,在两点空间上不可能出现这种分离。讨论了等价成立的充分条件(包括似然比的随机序),并指出了矩锥几何中一个未解决的刻画问题。
In classical robust hypothesis testing, least favorable distributions often simultaneously maximize all Chernoff $u$-affinities and minimize all $f$-divergences. This paper identifies the structural mechanism that causes this equivalence to fail in general: the cone generated by fractional power functions $\{x^u\}_{u\in(0,1)}$ is strictly smaller than the cone of convex functions, inducing a separation between fractional-moment dominance and convex-order dominance. An explicit minimal counterexample is constructed on a three-point probability space, with convex, compact uncertainty classes and uniformly bounded likelihood ratios, for which a single pair maximizes all Chernoff functionals uniformly yet fails to minimize a convex $f$-divergence. It is further proved that no such separation can occur on a two-point space. Sufficient conditions for equivalence -- including stochastic ordering of likelihood ratios -- are discussed, and an open characterization problem in the geometry of moment cones is highlighted.
指定输出分布的最小失真量化
Aolin Xu
AI总结 本文推导了在输出分布指定条件下最小化均方误差的最优量化器,形式为X=σ(F_{σ^{-1}(X)}^{-1}(F_W(W))),并证明了在均匀分布下简化为X=F_X^{-1}(F_W(W)),主要贡献在于通过优化排列和累积分布函数实现最小失真。
我们推导了实值随机变量 $W$(分布为 $P_W$)的最优量化器,使得 1) 量化输出 $X$(可取 $k$ 个值)的分布遵循 $\{1,\ldots,k\}$ 上的任意指定分布 $P_X$,且 2) 从 $X$ 估计 $W$ 的最小均方误差 (MMSE) 最小化。结果表明,最优量化器形式为 $X=\sigma\big(F_{\sigma^{-1}(X)}^{-1}(F_W(W))\big)$,其中 $\sigma$ 是 $\{1,\ldots,k\}$ 上所有排列中使 MMSE 最小的最优排列,$F$ 为累积分布函数。当 $P_W$ 在区间上均匀分布或 $P_X$ 在 $\{1,\ldots,k\}$ 上均匀分布时,量化器简化为 $X=F_{X}^{-1}(F_W(W))$。优超概念在最优性证明中起关键作用。指定输出分布有助于设计具有显式控制输出熵、最大化输入输出互信息、定制输出分布以匹配通信信道输入要求以及数据匿名化的量化器。
We derive the optimal quantizer of a real-valued random variable $W$ with distribution $P_W$ such that 1) the distribution of the quantization output $X$ that can take $k$ values follows any specified distribution $P_X$ over $\{1,\ldots,k\}$, and 2) the minimum mean squared error (MMSE) of estimating $W$ from $X$ is minimized. It is shown that the optimal quantizer takes the form $X=σ\big(F_{σ^{-1}(X)}^{-1}(F_W(W))\big)$, where $σ$ is the optimal permutation of $\{1,\ldots,k\}$ among all permutations to minimize the MMSE, and $F$ is the cumulative distribution function. When $P_W$ is uniform over an interval or $P_X$ is uniform over $\{1,\ldots,k\}$, the quantizer takes a simple form $X=F_{X}^{-1}(F_W(W))$. The concept of majorization plays a key role in the optimality proof. Specifying the output distribution is useful for designing quantizers with explicitly controlled output entropy, maximized mutual information between input and output, tailored output distribution to match channel input requirements for communication, and data anonymization.
双向随机投影
Chao Lan, Luyuan Yang
AI总结 本文分析固定设计下普通最小二乘回归的双向随机投影,导出基于投影数据的OLS估计的期望超额损失界,与仅行投影相比,差距约为O(p1 + C/p1),其中C随n1/n变化且可为负。
本文分析了固定设计设置下普通最小二乘(OLS)回归的双向随机投影。设$(X,Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n$为样本,$R \in \mathbb{R}^{n_1 \times n}, W \in \mathbb{R}^{p \times p_1}$为两个适当分布的随机投影。我们推导了基于$(WXR, WY)$构建的OLS估计量的期望超额损失界。与基于$(XR, Y)$构建的OLS估计量的已有界相比,差距约为$O\left( p_1 + C \frac{1}{p_1} \right)$,其中$C$随$n_1/n$缩放,且对于小的$n_1/n$可以为负。其含义通过真实世界数据的数值结果得到证实。
This paper analyzes bidirectional random projections for ordinary least squares (OLS) regression under the fixed design setting. Let $(X,Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n$ be a sample and $R \in \mathbb{R}^{n_1 \times n}, W \in \mathbb{R}^{p \times p_1}$ be two properly distributed random projections. We develop an expected excess loss bound for the OLS estimator built on $(WXR, WY)$. Compared to an established bound for OLS estimator built on $(XR, Y)$, the gap is approximately $O\left( p_1 + C \frac{1}{p_1} \right)$, where $C$ scales with $n_1/n$ and can be negative for small $n_1/n$. Its implications are confirmed by numerical results on real-world data.
通过切比雪夫插值和高斯-赛德尔迭代的高阶扩散采样
Bingyuan Wei, Meng Huang
AI总结 提出切比雪夫-高斯-赛德尔高阶采样器,在精确分数条件下,仅需 d^{1+o_T(1)}ε^{-1/K_1} 次分数函数即可达到总变差距离 ε,且放宽了有界支撑假设,对分数和雅可比估计误差具有鲁棒性。
高阶ODE求解器在通过概率流ODE加速扩散模型方面显示出强大的实证潜力,但此类加速的严格非渐近保证仍然有限。在本文中,我们开发了一种切比雪夫-高斯-赛德尔高阶采样器,并建立了非渐近收敛保证,允许近似阶数随外部迭代次数对数增长。在精确分数设置下,忽略对数因子,所提出的采样器最多需要 d^{1+o_T(1)}ε^{-1/K_1} 次分数函数,即可在总变差距离 ε 内逼近 ℝ^d 上的目标分布,其中 o_T(1)→0 当 T→∞,K_1>0 是一个足够大的常数。该分析仅假设目标分布具有多项式二阶矩界,从而放宽了现有高阶理论中施加的有界支撑条件。此外,该保证对分数和雅可比估计误差具有鲁棒性,并且不需要对分数估计施加高阶光滑性假设。在非各向同性高斯混合基准上的数值实验支持了在有限分数评估预算下精度-成本权衡的预期改进。
Higher-order ODE solvers have shown strong empirical promise for accelerating diffusion models through the probability flow ODE, but rigorous non-asymptotic guarantees for such acceleration remain limited. In this paper, we develop a Chebyshev--Gauss--Seidel higher-order sampler and establish a non-asymptotic convergence guarantee that allows the approximation order to grow logarithmically with the number of outer iterations. In the exact-score setting, up to logarithmic factors, the proposed sampler requires at most \[ d^{1+o_T(1)}\varepsilon^{-1/K_1} \] score functions to approximate the target distribution on \(\mathbb{R}^d\) within total variation distance \(\varepsilon\), where \(o_T(1)\to 0\) as \(T\to\infty\) and \(K_1>0\) is a sufficiently large constant. The analysis assumes only a polynomial second-moment bound on the target distribution, thereby relaxing the bounded-support condition imposed in existing higher-order theory. Moreover, the guarantee is robust to score and Jacobian estimation errors and does not require higher-order smoothness assumptions on the score estimates. Numerical experiments on anisotropic Gaussian mixture benchmarks support the predicted improvement in the accuracy--cost tradeoff under finite score-evaluation budgets.
核赌博机中的算法与极小极大复杂度
Yunbei Xu
AI总结 本文通过统一MAIR框架,将GP-UCB与MAMS算法置于共同语言下,提出结合两者优势的安全主算法,并证明在过参数化模型中算法复杂度比类宽极小极大或DEC证书更具信息性。
高斯过程上置信界(GP-UCB)和决策估计系数(DEC)方法乍看之下可能属于不同的理论。本文将这两种观点置于一个共同的算法信息语言中,用于频率学派RKHS赌博机。GP-UCB固定了一个算法性的(而非真实的)高斯过程先验,并利用实现轨迹的复杂度以及计算可处理性,而MAMS优化了一个鲁棒的类宽MAIR/DEC包络。通过统一的MAIR框架和异质半正定算法先验,我们推广了GP-UCB分析和MAMS算法,提出了一种结合两者优势的安全主算法,并提供了一个核赌博机构造,表明在过参数化模型中算法复杂度可以比类宽极小极大或DEC证书更具信息性。由此得出的信息是:算法信息和类宽极小极大系数回答不同的问题,并可能导致不同的差距;核赌博机提供了一个干净的环境,使得这种区别在数学上变得可见。
Gaussian-process upper confidence bound (GP-UCB) and decision-estimation-coefficient (DEC) methods may appear, at first sight, to belong to different theories. This paper places the two viewpoints in a common algorithmic-information language for frequentist RKHS bandits. GP-UCB fixes an algorithmic, rather than true, Gaussian-process prior and exploits realized-trajectory complexity together with computational tractability, whereas MAMS optimizes a robust class-wide MAIR/DEC envelope. Through the unified MAIR framework and heterogeneous positive-semidefinite algorithmic priors, we generalize both the GP-UCB analysis and the MAMS algorithm, propose a safeguarded master that combines their advantages, and provide a kernel-bandit construction showing that algorithmic complexity can be more informative than class-wide minimax or DEC certificates in overparameterized models. The resulting message is that algorithmic information and class-wide minimax coefficients answer different questions and can lead to different gaps; kernel bandits provide a clean setting in which this distinction becomes mathematically visible.
MMD经验估计的精确收敛速率与幂核
Francesco Colasanto, Matteo Focardi, Massimo Fornasier, Francesco Mattesini
AI总结 本文研究了使用幂核的最大均值差异(MMD)对概率测度进行经验估计的收敛速率,证明了在满足Ahlfors正则条件的测度下,最佳经验逼近的衰减速率为N的负一次方乘以(1+q/β)的平方根。
我们建立了通过最大均值差异(MMD)使用幂核K_q(x,y) = -|x-y|^q,q∈(0,2)对概率测度进行经验估计的收敛速率的定量结果。所得到的差异是经典的能量距离$$\mathcal E_q^2(μ, ω) = -\frac{1}{2}\iint_{\mathbb{R}^d \times \mathbb{R}^d} |x-y|^q \, d(μ- ω)(x)\, d(μ- ω)(y),$$我们询问当N→∞时,最佳N点经验逼近$$\inf_{μ_N \in \mathcal{P}^N}\mathcal{E}_q(μ_N,ω)$$衰减的速度。给定一个在\mathbb{R}^d上满足指数为β的Ahlfors正则条件的概率测度ω,我们证明了对于最坏情况的经验测度μ_N(下界,对任意N点配置成立)和最优选择的经验测度μ_N(上界),精确的双侧界$$\mathcal E_q(μ_N, ω) \asymp N^{-\frac{1}{2}\left(1 + \frac{q}β\right)}$$成立。这补充了Fornasier和Hütter [1] 的定性一致性结果,他们证明了在经验测度上MMD^2(·, ω)的最小值在N→∞时的窄收敛,但没有定量速率。
We establish quantitative rates of convergence for the empirical estimation of probability measures by means of the Maximum Mean Discrepancy (MMD) with power kernel $K_q(x,y) = -|x-y|^q$, $q \in (0,2)$. The resulting discrepancy is the classical \emph{energy distance} $$\mathcal E_q^2(μ, ω) = -\frac{1}{2}\iint_{\mathbb{R}^d \times \mathbb{R}^d} |x-y|^q \, d(μ- ω)(x)\, d(μ- ω)(y),$$ and we ask how fast the best $N$-point empirical approximation $\inf_{μ_N \in \mathcal{P}^N}\mathcal{E}_q(μ_N,ω)$ decays as $N \to \infty$. Given a probability measure $ω$ on $\mathbb{R}^d$ with compact support satisfying an Ahlfors regularity condition of exponent $β\in (0,d]$, we prove that the sharp two-sided bound $$\mathcal E_q(μ_N, ω) \asymp N^{-\frac{1}{2}\left(1 + \frac{q}β\right)}$$ holds both for the worst-case empirical measure $μ_N$ (lower bound, holding for every configuration of $N$ points) and for an optimally chosen empirical measure $μ_N$ (upper bound). This complements the qualitative consistency result of Fornasier and Hütter \cite{fornasier2014consistency}, who proved narrow convergence of the minimizers of $\mathcal E_q^2(\cdot, ω)$ over empirical measures without quantitative rates.
无损数据压缩的样本复杂度
Terence Viaud, Ioannis Kontoyiannis
AI总结 提出非渐近框架研究无损压缩的基本极限,定义样本复杂度为在给定速率和超概率约束下所需的最小块长,证明无记忆源的样本复杂度由1/2阶Rényi熵决定,并推广至马尔可夫源和通用压缩。
引入了一个新框架来检验和评估无损数据压缩的基本极限,该框架强调真正的非渐近结果。给定源的{\em 样本复杂度}定义为在特定约束速率和指定超概率范围内压缩该源所需的最小块长。这一表述与统计学和计算机科学中的相应发展相平行,并便于利用各种假设检验问题的样本复杂度的现有结果。对于任意源,一般变长压缩机的样本复杂度被证明与前缀码和定长码的样本复杂度紧密耦合。对于无记忆源,样本复杂度不是由源熵决定,而是由它的1/2阶Rényi熵决定。获得了样本复杂度的非渐近界,并带有显式常数。推广到马尔可夫源,表明样本复杂度由源的1/2阶Rényi熵率决定。最后,针对无记忆源族,发展了通用数据压缩的样本复杂度界。在那里,样本复杂度由族中元素与均匀分布之间的1/2阶Rényi散度的最小值决定。探讨并讨论了该问题与身份检验及相应分离率之间的联系。
A new framework is introduced for examining and evaluating the fundamental limits of lossless data compression, that emphasizes genuinely non-asymptotic results. The {\em sample complexity} of compressing a given source is defined as the smallest blocklength at which it is possible to compress that source at a specifically constrained rate and to within a specified excess-rate probability. This formulation parallels corresponding developments in statistics and computer science, and it facilitates the use of existing results on the sample complexity of various hypothesis testing problems. For arbitrary sources, the sample complexity of general variable-length compressors is shown to be tightly coupled with the sample complexity of prefix-free codes and fixed-length codes. For memoryless sources, it is shown that the sample complexity is characterized not by the source entropy, but by its Rényi entropy of order~$1/2$. Nonasymptotic bounds on the sample complexity are obtained, with explicit constants. Generalizations to Markov sources are established, showing that the sample complexity is determined by the source's Rényi entropy rate of order~$1/2$. Finally, bounds on the sample complexity of universal data compression are developed for families of memoryless sources. There, the sample complexity is characterized by the minimum Rényi divergence of order~$1/2$ between elements of the family and the uniform distribution. The connection of this problem with identity testing and with the associated separation rates is explored and discussed.
关于未指定方向的轴向对称性检验
Alejandro Cholaquidis, Juan Cuesta-Albertos, Ricardo Fraiman, Manuel Hernández-Banadik
AI总结 针对多元分布未知方向的轴向对称性检验问题,利用协方差矩阵的简单谱假设将候选方向缩减为有限个,通过投影数据和样本分裂构造Kolmogorov-Smirnov型统计量,并证明其渐近分布和bootstrap有效性。
我们考虑检验多元分布是否关于某个未知方向轴向对称的问题。在协方差矩阵的简单谱假设下,任何对称轴必须与协方差矩阵的一个特征向量重合,因此问题简化为检验有限个候选方向。对于每个候选方向,我们基于投影数据和样本分裂构造一个Kolmogorov--Smirnov型统计量。我们在三角阵列框架下推导其渐近分布,并在适当的正则性条件下建立bootstrap有效性。这导致了一个在对称方向未指定时可行的轴向对称性检验程序。
We consider the problem of testing whether a multivariate distribution is axially symmetric about some unknown direction. Under a simple-spectrum assumption on the covariance matrix, any symmetry axis must coincide with an eigenvector of the covariance matrix, so the problem reduces to testing a finite set of candidate directions. For each candidate direction, we construct a Kolmogorov--Smirnov-type statistic based on projected data and sample splitting. We derive its asymptotic distribution in a triangular-array framework and establish bootstrap validity under suitable regularity conditions. This leads to a feasible testing procedure for axial symmetry when the symmetry direction is unspecified.
三层神经网络局部学习系数的上界
Yuki Kurumadani
AI总结 针对三层神经网络的奇异参数点,提出一种基于预算、需求和供给约束的计数规则来推导局部学习系数的上界,覆盖了swish等激活函数,并在一维输入下与已知精确值一致。
已知三层神经网络构成奇异学习模型,其贝叶斯渐近行为由学习系数(或实对数规范阈值)控制。尽管该量在正则模型和某些特殊奇异模型中已被阐明,但在神经网络中广泛适用的评估方法仍然有限。最近,半正则模型的局部学习系数公式被提出,给出了学习系数的上界。然而,该公式仅适用于实现参数集中的非奇异点,不能用于奇异点。特别是对于三层神经网络,所得上界在某些情况下与已知的学习系数值存在显著差异。本文推导了三层神经网络中一类奇异实现参数的局部学习系数上界公式。该公式可解释为在预算、需求和供给约束下的计数规则。在非多项式实解析情况下,该公式适用于一般设置;而在多项式情况下,它适用于真实分布没有隐藏单元的限制。特别地,我们的结果涵盖了诸如swish函数等激活函数,并在上述限制下包括多项式激活函数,从而将先前结果扩展到更广泛的激活函数类。我们进一步证明,当输入维度为一时,上界公式右侧的数值与先前已知的学习系数一致,从而提供了与已知精确结果的有用比较。我们的结果还提供了关于三层神经网络权重参数如何影响学习系数的系统视角。
Three-layer neural networks are known to form singular learning models, and their Bayesian asymptotic behavior is governed by the learning coefficient, or real log canonical threshold. Although this quantity has been clarified for regular models and for some special singular models, broadly applicable methods for evaluating it in neural networks remain limited. Recently, a formula for the local learning coefficient of semiregular models was proposed, yielding an upper bound on the learning coefficient. However, this formula applies only to nonsingular points in the set of realization parameters and cannot be used at singular points. In particular, for three-layer neural networks, the resulting upper bound has been shown to differ substantially from learning coefficient values already known in some cases. In this paper, we derive a formula for an upper bound on local learning coefficients at a class of singular realization parameters in three-layer neural networks. This formula can be interpreted as a counting rule under budget, demand, and supply constraints. In the non-polynomial real-analytic case, the formula applies in general settings, whereas in the polynomial case it applies under the restriction that the true distribution has no hidden units. In particular, our result covers activation functions such as the swish function and also includes polynomial activation functions under the above restriction, thereby extending previous results to a broader class of activation functions. We further show that, when the input dimension is one, the numerical value given by the right-hand side of our upper-bound formula agrees with the previously known learning coefficient, thereby providing a useful comparison with known exact results. Our result also provides a systematic perspective on how the weight parameters of three-layer neural networks affect the learning coefficient.
通过正则化最优传输进行部分识别矩模型的推断
Grigory Franguridi, Laura Liu
AI总结 提出基于正则化最优传输的部分识别GMM模型推断方法,用熵正则化近似支撑函数并利用Sinkhorn算法高效计算,建立熵正则化OT的CLT,通过bootstrap获得有效临界值,在蒙特卡洛模拟和幸福度面板logit模型中验证性能。
许多统计和计量经济学问题涉及由联合分布的矩定义的参数,但仅观测到边际分布,这自然导致部分识别。我们开发了一种用于相应部分识别GMM模型的识别、估计和推断方法。我们通过支撑函数/最优传输(OT)表示来刻画感兴趣参数的尖锐识别集。为了估计识别集,我们采用熵正则化,它提供了经典OT问题的光滑近似,可以使用Sinkhorn算法高效计算。我们还提出了用于假设检验和构建识别集置信区域的检验统计量。为了推导其渐近分布,我们建立了在一般光滑成本函数下熵正则化OT值的新中心极限定理。然后,我们使用Fang和Santos(2019)的方向可微泛函的bootstrap获得有效临界值。所得检验过程在局部均匀地控制大小,包括在识别集边界上的参数值处。我们在蒙特卡洛模拟中展示了我们方法的良好有限样本性能。最后,作为实证说明,我们使用来自“理解美国研究”的数据,估计了一个带有流失和补充的自报幸福度的面板logit模型。
Many statistical and econometric problems involve parameters defined by moments of a joint distribution when only marginal distributions are observed, leading naturally to partial identification. We develop a methodology for identification, estimation, and inference in the corresponding partially identified GMM model. We characterize the sharp identified set for the parameter of interest via a support-function/optimal-transport (OT) representation. To estimate the identified set, we employ entropic regularization, which yields a smooth approximation to the classical OT problem that can be computed efficiently using the Sinkhorn algorithm. We also propose a test statistic for hypothesis testing and the construction of confidence regions for the identified set. To derive its asymptotic distribution, we establish a novel central limit theorem for the entropic OT value under general smooth cost functions. We then obtain valid critical values using the bootstrap for directionally differentiable functionals of Fang and Santos (2019). The resulting testing procedure controls size locally uniformly, including at parameter values on the boundary of the identified set. We demonstrate good finite-sample performance of our methodology in Monte Carlo simulations. Finally, as an empirical illustration, we estimate a panel logit model of self-reported happiness with attrition and refreshment, using data from the Understanding America Study.
关于非对称核的行列式点过程
Poinas Arnaud
AI总结 本文利用$P_0$矩阵理论给出非对称核行列式点过程良定义的必要充分条件,并推广常见结果,进而构造对称核正则DPP的吸引耦合以建模异标记点间的吸引。
行列式点过程(简称DPP)是一类排斥点过程。它们在统计中用于建模具有近距离排斥性的空间点模式数据集。在有限集上的DPP中,它们由一个称为DPP核的矩阵定义,该矩阵通常假设为对称的。虽然存在一些非对称核的DPP例子,但关于这对它们通常性质的影响知之甚少。在本文中,我们展示了如何将关于$P_0$矩阵的结果适应到DPP设置中,以获得非对称核DPP良定义的充分必要条件。我们还推广了DPP上的各种常见结果。然后,我们展示了如何利用这些结果构造具有对称核的正则DPP的吸引耦合,以建模具有相同标记点之间排斥和不同标记点之间吸引的空间标记点模式。
Determinantal point processes (DPPs for short) are a class of repulsive point processes. They have found some statistical applications to model spatial point pattern datasets with repulsion between close points. In the case of DPPs on finite sets, they are defined by a matrix called the DPP kernel which is usually assumed to be symmetric. While there are a few known examples of DPPs with nonsymmetric kernels, not much is known on how this affects their usual properties. In this paper, we demonstrate how to adapt the results on $P_0$ matrices to the DPP setting in order to get necessary and sufficient conditions for the well-definedness of DPPs with nonsymmetric kernels. We also generalize various common results on DPPs. We then show how to use these results to construct attractive couplings of regular DPPs with symmetric kernels in order to model spatial marked point patterns with repulsion between points of the same mark and attraction between points of different marks.
离散和连续时间下部分观测多项式过程的最优线性滤波
Jan Kallsen, Ivo Richert
AI总结 针对部分观测的多项式过程,利用其与高斯过程在二阶矩上的不可区分性,构造高斯等价过程并显式计算最优线性滤波器、预测器和平滑器。
本文致力于部分观测的多项式过程的滤波、平滑和预测。已知在线性高斯状态空间模型的简单情形下,这些问题允许显式解。本研究的核心见解是,在线性滤波应用中,多项式过程及其离散时间对应物与共享前两个矩的高斯过程不可区分。我们描述了这些多项式过程的高斯等价的构造,并显式计算了离散和连续时间下多项式过程的最优线性滤波器、预测器和平滑器。高斯等价的考虑也为多项式过程中的参数估计和线性二次最优控制打开了大门。
This paper is devoted to filtering, smoothing, and prediction of polynomial processes that are partially observed. These problems are known to allow for an explicit solution in the simpler case of linear Gaussian state space models. The key insight underlying the present piece of research is that in linear filtering applications polynomial processes and their discrete-time counterpart are indistinguishable from Gaussian processes sharing their first two moments. We describe the construction of these Gaussian equivalents of polynomial processes and explicitly compute optimal linear filters, predictors and smoothers for polynomial processes in discrete and continuous time. The consideration of Gaussian equivalents also opens the door to parameter estimation and linear-quadratic optimal control in the context of polynomial processes.
Hadamard空间中变换Fréchet均值的方差不等式
Christof Schötz
AI总结 研究Hadamard空间中变换Fréchet均值的方差不等式,涵盖Fréchet中位数、均值及Huber损失诱导均值,刻画了远离最小化器时期望变换距离的增长,并给出了Fréchet中位数唯一性的刻画。
Fréchet均值(或重心)通过最小化到随机变量的期望平方距离,将随机变量的期望推广到度量空间。类似地,中位数可以通过其最小化期望绝对距离的性质来推广。我们考虑具有非递减凸变换且其导数为凹的一类变换Fréchet均值。该类包括Fréchet中位数、Fréchet均值、Huber损失诱导的Fréchet均值,以及与度量空间中稳健统计相关的其他统计量。我们研究这些变换Fréchet均值的方差不等式。这些不等式描述了当远离最小化器(即变换Fréchet均值)时,期望变换距离如何增长。方差不等式在变换Fréchet均值的估计和数值逼近理论中很有用。我们重点关注Hadamard空间(全局非正曲率的度量空间)中的方差不等式。值得注意的是,一些结果对欧几里得空间也是新的。此外,我们能够刻画变换Fréchet均值的唯一性,特别是Fréchet中位数的唯一性。
The Fréchet mean (or barycenter) generalizes the expectation of a random variable to metric spaces by minimizing the expected squared distance to the random variable. Similarly, the median can be generalized by its property of minimizing the expected absolute distance. We consider the class of transformed Fréchet means with nondecreasing, convex transformations that have a concave derivative. This class includes the Fréchet median, the Fréchet mean, the Huber loss-induced Fréchet mean, and other statistics related to robust statistics in metric spaces. We study variance inequalities for these transformed Fréchet means. These inequalities describe how the expected transformed distance grows when moving away from a minimizer, i.e., from a transformed Fréchet mean. Variance inequalities are useful in the theory of estimation and numerical approximation of transformed Fréchet means. Our focus is on variance inequalities in Hadamard spaces - metric spaces with globally nonpositive curvature. Notably, some results are new also for Euclidean spaces. Additionally, we are able to characterize uniqueness of transformed Fréchet means, in particular of the Fréchet median.
超越高斯模型的几何植入匹配
Lucas da Rocha Schwengber, Roberto Imbuzeiro Oliveira
AI总结 研究随机点集与其扰动点集之间未知匹配的恢复问题,利用随机几何图中的匹配推导极小极大下界,并证明最小化欧氏距离平方和的估计器在固定维度下达到最优,在高维条件下以高概率无差错。
我们考虑在 $\mathbb{R}^d$ 中随机放置的 $n$ 个点与其随机扰动之间恢复未知匹配的问题。这可以视为粒子追踪以及更一般的实体解析的模型。我们利用随机几何图中的匹配来推导该问题在极大 generality 下的极小极大下界。利用这些结果,我们证明对于固定的 $d$,只要噪声分布具有有限的 $d$ 阶矩,且初始位置和噪声都具有有界连续密度,该问题的极小极大率以 $\Theta(n^2\sigma^d \wedge n)$ 缩放。在更强的假设下(噪声尾部为次高斯),我们证明当 $d$ 固定时,最小化欧氏距离平方和的估计器产生的错误数量阶是极小极大最优的;当 $d = o(\log n)$ 时,该最优性达到 $n^{o(1)}$ 因子内。在高维情形中,我们考虑初始位置和扰动都具有独立次高斯坐标的设置。在此设置下,我们给出充分条件,使得同一估计器以高概率不犯错误。我们对该估计器的改编版本(融入扰动协方差矩阵信息)证明了类似结果。
We consider the problem of recovering an unknown matching between a set of $n$ randomly placed points in $\mathbb{R}^d$ and random perturbations of these points. This can be seen as a model for particle tracking and more generally, entity resolution. We use matchings in random geometric graphs to derive minimax lower bounds for this problem that hold under great generality. Using these results we show that for a fixed $d$, as long as the noise distribution has finite $d$-th moment, and both initial positions and noise have bounded continuous densities, the minimax rate for the problem scales as $Θ(n^2σ^d \wedge n)$. Under the stronger assumptions that the tail of the noise is sub-Gaussian, we show that the order of the number of mistakes made by an estimator that minimizes the sum of squared Euclidean distances is minimax optimal when $d$ is fixed and is optimal up to $n^{o(1)}$ factors when $d = o(\log n)$. In the high-dimensional regime we consider a setup where both initial positions and perturbations have independent sub-Gaussian coordinates. In this setup we give sufficient conditions under which the same estimator makes no mistakes with high probability. We prove an analogous result for an adapted version of this estimator that incorporates information on the covariance matrix of the perturbations.
Klaus Kähler Holst, Andreas Nordland, Julie Funch Furberg, Lars Holm Damgaard, Christian Bressen Pipper
Analysis of data from randomized controlled trials in vulnerable populations requires special attention when assessing treatment effect by a score measuring, e.g., disease stage or activity together with onset of prevalent terminal events. In reality, it is impossible to disentangle a disease score from the terminal event, since the score is not clinically meaningful after this event. In this work, we propose to assess treatment interventions simultaneously on the terminal event and the disease score in the absence of a terminal event. Our proposal is based on a natural data-generating mechanism, respecting that a disease score does not exist beyond the terminal event. We use modern semi-parametric statistical methods to provide robust and efficient estimation of the risk of terminal event and expected disease score conditional on no terminal event at a pre-specified landmark time. We also use the simultaneous asymptotic behaviour of our estimators to develop a powerful closed testing procedure for confirmatory assessment of treatment effect on both onset of terminal event and level of disease score in the absence of a terminal event. A simulation study mimicking a large-scale outcome trial in chronic kidney patients as well as an analysis of that trial is provided to assess performance.
Sarah C. Lotspeich, Abbey Collins, Brian J. Wells, Ashish K. Khanna, Joseph Rigdon, Lucy D'Agostino McGowan
Objective: Electronic health records (EHR) data are prone to missingness and errors. Previously, we devised an "enriched" chart review protocol where a "roadmap" of auxiliary diagnoses (anchors) was used to recover missing values in EHR data (e.g., a diagnosis of impaired glycemic control might imply that a missing hemoglobin A1c value would be considered unhealthy). Still, chart reviews are expensive and time-intensive, which limits the number of patients whose data can be reviewed. Now, we investigate the accuracy and scalability of a roadmap-driven algorithm, based on ICD-10 codes (International Classification of Diseases, 10th revision), to mimic expert chart reviews and recover missing values. Materials and Methods: In addition to the clinicians' original roadmap from our previous work, we consider new versions that were iteratively refined using large language models (LLM) in conjunction with clinical expertise to expand the list of auxiliary diagnoses. Using chart reviews for 100 patients from the EHR at an extensive learning health system, we examine algorithm performance with different roadmaps. Using the larger study of $1000$ patients, we applied the final algorithm, which used a roadmap with clinician-approved additions from the LLM. Results: The algorithm recovered as much, if not more, missing data as the expert chart reviewers, depending on the roadmap. Discussion: Clinically-driven algorithms (enhanced by LLM) can recover missing EHR data with similar accuracy to chart reviews and can feasibly be applied to large samples. Extending them to monitor other dimensions of data quality (e.g., plausability) is a promising future direction.