arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2251
2606.10681 2026-06-10 econ.TH 新提交

Limited belief propagation and contingent thinking

有限信念传播与权变思维

Andrew Ellis, Ran Spiegler

AI总结 本文通过有向无环图上的有限推理步骤,刻画了观察后信念更新的非贝叶斯特征,解释了相关忽视和迭代期望违背,并应用于公共品供给和社会学习博弈。

详情
AI中文摘要

一个智能体在观察部分变量后更新其对一组变量的信念。我们提供了更新信念的一种表示,该表示捕捉了观察结果的含义通过表示所有变量之间关系的有向无环图进行有限传播。当她从未观察变量到观察变量进行的推理步骤较少时,就会发生权变思维的失败,导致相关忽视和迭代期望的违背。我们的框架为关于权变思维的现有实验提供了新视角,并提出了新的方向。我们刻画了该模型与熟悉的贝叶斯和非贝叶斯基准之间的关系,并通过公共品供给和社会学习博弈的应用加以说明。

英文摘要

An agent updates her beliefs over a set of variables after observing some of them. We provide a representation of updated beliefs that captures limited propagation of her observation's implications through the directed acyclic graph that represents the relations between all variables. Failure of contingent thinking occurs when she performs fewer inference steps from unobserved variables than observed ones, leading to correlation neglect and violations of iterated expectations. Our framework offers a new perspective on existing experiments about contingent thinking and suggests new directions. We characterize the model's relationship with familiar Bayesian and non-Bayesian benchmarks, and illustrate it with applications to public-good provision and social learning games.

2606.10438 2026-06-10 econ.TH 新提交

Sequential Search with Planning

带有规划的序贯搜索

Ruhi Sonal, Saptarshi Mukherjee, Abhinaba Lahiri, Aniruddha Ghosh

AI总结 本文通过有序潘多拉盒子模型研究新产品开发或资源勘探中的序贯搜索,引入规划成本,证明存在与已支付范围相关的保留值,并分析保证效应、已支付范围效应和剩余阶段效应对最优策略的影响。

详情
AI中文摘要

新产品或技术的序贯开发,或自然资源的勘探,通常通过有序阶段进行,具有不确定的回报,并且需要昂贵的(事前)规划以使未来阶段可访问。我们将此过程建模为一个有序的潘多拉盒子问题,其中决策者首先选择一个初始范围,支付随可访问阶段数量增加的成本,并可能随后以边际调整成本扩大范围。由于已支付的规划成本是沉没的,续值取决于状态变量“已支付范围”。我们证明了与范围相关的保留值的存在性和唯一性,将最优搜索策略刻画为由已支付范围索引的阈值规则,并推导出比较静态。三种经济力量之间的相互作用塑造了最优行为——保证效应(更高的当前最佳报价降低了下一阶段的预期改进并导致更早停止)、已支付范围效应(更大的预付范围降低了未来访问的边际成本,提高了续值,并在更高保证下支持继续)以及剩余阶段效应(剩余阶段越少,继续的期权价值越小)。两个例子说明了这些力量如何在正态和厚尾回报下产生不同的规划和搜索模式。

英文摘要

Sequential development of a new product or technology, or natural resource exploration, often progresses through ordered stages with uncertain rewards and requires costly (ex ante) planning to make future stages accessible. We model this process as an ordered Pandora's box problem where a decision-maker first chooses an initial scope, paying a cost that rises with the number of stages made accessible, and may later expand the scope at a marginal adjustment cost. Since the paid planning costs are sunk, the continuation values depend on the state variable ``paid scope''. We prove existence and uniqueness of scope-dependent reservation values, characterize the optimal search strategy as a threshold rule indexed by paid scope, and derive comparative statics. Interactions among three economic forces shape the optimal behavior -- a guarantee effect (a higher current best offer reduces the expected improvement from the next stage and induces earlier stopping), a paid-scope effect (a larger prepaid scope lowers the marginal cost of future access, raises the continuation value, and supports continuation at higher guarantees), and a remaining-horizon effect (fewer stages remaining shrink the option value of continuing). Two examples illustrate how these forces generate distinct planning and search patterns under normal and fat-tailed rewards.

2606.10127 2026-06-10 econ.TH 新提交

Data-Driven Automation

数据驱动的自动化

Maryam Farboodi, Andrew Koh, Anchi Xia

AI总结 本文构建了一个数据驱动的自动化动态模型,研究数据异质性、内生积累和溢出效应如何影响自动化进程,发现长期自动化速度遵循幂律衰减,且经济通常无效率。

详情
AI中文摘要

我们构建了一个数据驱动的自动化动态模型,其中数据(i)是异质且任务特定的;(ii)作为经济活动的副产品内生积累;且(iii)表现出溢出效应,使得一个任务生成的数据可以增强另一个任务的生产率。在自动化的转型路径上,数据扮演着双重角色:同时增强已自动化任务的生产率并扩展自动化前沿。我们推导了经济长期部分自动化与完全自动化的严格条件。在后一种情况下,自动化表现出丰富的短期动态,取决于数据溢出的模式,但长期总是缓慢的:劳动力生产的任务份额随时间渐近地服从幂律衰减。我们表明经济通常是低效的,并分析规划者如何最优地倾斜数据积累的方向。在资本内生积累的情况下,数据驱动的自动化产生爆炸性增长,但长期工资停滞。

英文摘要

We build a dynamic model of data-driven automation in which data (i) is heterogeneous and task-specific; (ii) accumulates endogenously as a byproduct of economic activity; and (iii) exhibits spillovers such that data generated by one task can augment the productivity of another. Along the transition path of automation, data plays a dual role in simultaneously augmenting the productivity of already-automated tasks and expanding the automation frontier. We derive tight conditions for the economy to be partially versus fully automated in the long-run. In the latter case, automation exhibits rich short-run dynamics that depend on the pattern of data spillovers but is always slow in the long-run: the share of tasks produced by labor decays asymptotically as a power law in time. We show that the economy is generically inefficient and analyze how a planner optimally tilts the direction of data accumulation. With endogenous capital accumulation, data-driven automation generates explosive growth but stagnant long-run wages.

2606.11135 2026-06-10 eess.SP 新提交

Pre-Fault Voltage Discrimination and Time-Domain Protection for Distribution Networks with Inverter-Based Resources

含逆变器资源的配电网故障前电压判别与时域保护

Junyuan Zhao, François Bouffard, Géza Joós

AI总结 针对逆变器资源导致传统过流保护失效的问题,提出故障前电压判别策略结合时域保护原理,实现快速可靠故障检测。

详情
AI中文摘要

配电网中逆变器资源(IBRs)的日益普及给基于相量的过流保护带来了重大挑战。这一挑战源于IBRs缺乏短路电流供给能力。因此,传统的过流保护功能(例如ANSI 51)在此类场景中不足,需要替代方法。例如,时域保护有望克服这一挑战。本文提出了一种故障前电压判别(PVD)策略,其作用是检测故障并将正常开关和变压器励磁涌流扰动与实际故障区分开。PVD的使用允许通过使用含IBRs配电网的时域保护原理,设计一种简单而有效的故障检测算法。PVD的引入提供了更快的故障检测,同时不降低安全性和可靠性。离线仿真实验和控制器硬件在环实时仿真验证了所提算法在各种故障和正常开关事件中的有效性。

英文摘要

The increasing proliferation of inverter-based resources (IBRs) in distribution networks is presenting a major challenge for phasor-based overcurrent protection. This challenge stems from IBRs' lack of short-circuit current sourcing capacity. As a result, traditional overcurrent protection functions (e.g., ANSI 51) are inadequate in such scenarios, and warrant alternative approaches. Time-domain protection, for example, shows promise in overcoming this challenge. In this paper we propose a pre-fault voltage discrimination (PVD) strategy whose role is to detect faults and discriminate normal switching and transformer inrush disturbances from actual faults. The use of PVD allows for the design of a simple, yet effective fault detection algorithm by using time-domain protection principles for distribution networks containing IBRs. The introduction of PVD provides for faster fault detection without reducing security and dependability. Offline simulation experiments and controller hardware-in-the-loop real-time simulation validate the effectiveness of the proposed algorithm against various fault and normal switching events.

2606.10900 2026-06-10 eess.SP 新提交

Personalized Deep Learning for Short-Term Forecasting of Impending Atrial Fibrillation from Continuous Wearable ECG Signals

基于个性化深度学习的连续可穿戴心电图信号短期房颤预测

Jangwon Suh, Soonil Kwon, Jungmin Ko, Yun Kwan Kim, Hee Seok Song, Eue-Keun Choi, Wonjong Rhee

AI总结 针对可穿戴心电图中房颤预测的个体差异问题,提出通过微调全局模型实现个性化预测,在三个数据集上显著提升性能,并揭示了心率、RMSSD等临床相关前兆特征。

Comments Code is available at https://github.com/SNU-DRL/Personalized-AF-Forecasting

详情
AI中文摘要

背景与目的:连续可穿戴心电图监测越来越多地用于动态心律失常监测,然而预测即将发生的房颤面临患者间心电图变异的挑战。本研究探讨了通过基于个体心电图信号的微调来个性化全局模型是否能改善即将发生房颤的短期预测。方法:在ICENTIA11K数据集上训练的全局模型与在三个队列(ICENTIA11K、IRIDIA-AF和MobiCARE)上微调的个性化模型进行了比较。预处理后,模型处理60秒的心电图片段,预测未来五分钟。我们评估了适应数据量的影响,并分析了心电图特征,如心率和RMSSD。结果:个性化模型显著优于全局模型,在ICENTIA11K中AUROC为0.711 vs. 0.614,在MobiCARE中为0.686 vs. 0.585。个性化收益随着患者特定微调数据量的增加而增加。虽然全局模型的准确性随着房颤发作的临近而提高,但两个外部队列中的个性化模型表现出不同的时间动态,这可能表明捕获了患者特定特征,这些特征较少依赖于房颤事件的临近性。房颤前发作显示心率和RMSSD升高。特征归因突出了临床相关的前兆,包括频繁的房性早搏和短阵室上性心动过速。结论:使用患者特定的可穿戴心电图数据自适应深度学习模型显著增强了即将发生房颤的短期预测。这种个性化框架支持及时的预防性干预,并改善动态监测环境中的房颤管理。

英文摘要

Background and Objective: Continuous wearable electrocardiogram (ECG) monitoring is increasingly used for ambulatory arrhythmia surveillance, yet forecasting impending atrial fibrillation (AF) is challenged by inter-patient ECG variability. This study investigated whether personalizing a global model via fine-tuning on an individual's ECG signals improves short-term forecasting of impending AF. Methods: A global model trained on the ICENTIA11K dataset was compared against personalized models fine-tuned across three cohorts: ICENTIA11K, IRIDIA-AF, and MobiCARE. Following preprocessing, models processed 60-second ECG segments for a five-minute forecast horizon. We evaluated the impact of adaptation data volume and analyzed ECG features, such as heart rate and RMSSD. Results: Personalized models significantly outperformed the global model, achieving AUROCs of 0.711 vs. 0.614 in ICENTIA11K and 0.686 vs. 0.585 in MobiCARE. Personalization benefits increased with the amount of patient-specific fine-tuning data. While the global model's accuracy rose as AF onset approached, personalized models in the two external cohorts exhibited distinct temporal dynamics, which may indicate the capture of patient-specific characteristics less dependent on proximity to the AF event. Pre-AF episodes showed elevated heart rates and RMSSD. Feature attributions highlighted clinically relevant precursors, including frequent premature atrial complexes (PACs) and short supraventricular tachycardias (SVTs). Conclusions: Adapting deep learning models with patient-specific wearable ECG data significantly enhances short-term forecasting of impending AF. This personalized framework supports timely preventive interventions and improved AF management in ambulatory monitoring environments.

2606.10869 2026-06-10 eess.SP 新提交

Information Bottleneck Meets Quantization: Finite Rate Analysis and Optimal Designs

信息瓶颈遇上量化:有限速率分析与最优设计

Francesco Binucci, Paolo Banelli

AI总结 本文理论分析了高斯信息瓶颈(GIB)潜在表示的标量和向量量化对目标数据信息性的影响,并提出了在有限速率约束下的任务导向量化设计,在MMSE回归问题上验证了有效性,最后将任务导向思想扩展到非高斯场景。

Comments 16 pages, 9 figures

详情
AI中文摘要

信息瓶颈(IB)是一个成熟的框架,通过权衡速率和数据表示大小,寻找数据源的潜在紧凑表示,以获得相对于另一个目标数据的信息准确性。当目标与源联合高斯时,高斯IB(GIB)是其简单的闭式解。然而,在许多实际问题中,潜在表示必须由有限数量的比特存储或表示,而最优(G)IB解则不然。首先,本文从理论上分析了标量和向量量化对GIB潜在表示的影响,以及其对目标数据(非)信息性的影响。然后,通过在潜在表示上施加有限速率约束,重新表述GIB优化问题,提出了任务导向的量化设计。在MMSE回归问题上的仿真结果证实了所提出的量化设计的有效性,与标准GIB潜在表示的更启发式或分离的量化设计相比,显示出显著的增益。最后,通过适当修改用于IB启发的向量量化器的变分自编码器(VAE)中的代价函数,将任务导向思想扩展到非高斯设置。

英文摘要

The Information Bottleneck (IB) is a well established framework that looks for a latent compact representation of a data source, by trading rate and data-size representation, for information accuracy with respect to another target data. The Gaussian IB (GIB) is its simple closed form solution, when the target is jointly Gaussian with the source. Actually, in many practical problems the latent representation has to be stored or represented by a finite number of bits, while the optimal (G)IB solution has not. First, this manuscript theoretically analyzes the effect of scalar and vector quantization of the GIB latent representation, and its impact on the (dis)informativeness with respect to the target data. Then, task-oriented quantization designs are proposed by (jointly) reformulating the GIB optimization problem under a finite-rate constraint on the latent representation. Simulation results on MMSE regression problems confirm the effectiveness of the proposed quantization designs, which show significant gains with respect to more heuristic, or separate, quantization designs of the standard GIB latent representation. Finally, the paper extends the task-oriented philosophy to non-Gaussian settings, by properly modifying the cost function used in variational auto-encoders (VAEs) of IB-inspired vector quantizers.

2606.10864 2026-06-10 eess.AS 新提交

Phoneme-First Prediction for LLM-Based Speech Recognition

基于LLM的语音识别的音素优先预测

Jakob Poncelet, Hugo Van hamme

AI总结 提出在LLM中集成音素预测步骤,先预测音素再生成转录,以提升低资源场景下的语音识别准确性和可解释性。

Comments Accepted at EUSIPCO 2026

详情
AI中文摘要

近期研究探索了将大型语言模型(LLM)与语音编码器集成,以创建能够进行上下文感知语音识别的语音增强型LLM。主要挑战在于将LLM的语义嵌入与语音编码器的声学表示对齐。我们提出了一种新颖的方法,教导LLM首先从语音特征中预测音素,然后再生成最终转录。通过将音素预测步骤直接集成到LLM中,模型能够获得细粒度的发音知识,减少声学混淆,提高转录准确性和可解释性。我们的方法廉价且简单,因为音素目标可以从现有转录中自动推导。通过全面的实验,我们表明中间音素预测可以改善语音识别,特别是在低资源设置下,并且产生的输出在声学上更忠实于语音。

英文摘要

Recent research has explored integrating Large Language Models (LLMs) with speech encoders to create speech-augmented LLMs capable of contextualized speech recognition. The main challenge lies in aligning the semantic embeddings of LLMs with the acoustic representations of speech encoders. We propose a novel approach that teaches the LLM to first predict phonemes from the speech features before generating the final transcript. By integrating a phoneme prediction step directly into the LLM, the model develops a fine-grained knowledge of pronunciation, reducing acoustic confusion and improving transcription accuracy and explainability. Our method is cheap and simple, as phoneme targets can be automatically derived from existing transcripts. Through comprehensive experiments, we show that intermediate phoneme prediction can improve speech recognition, particularly in low-resource settings, and yields outputs that are acoustically more faithful to the speech.

2606.10853 2026-06-10 eess.AS 新提交

Speech Encoder Fusion for LLM-based Automatic Speech Recognition

面向基于LLM的自动语音识别的语音编码器融合

Jakob Poncelet, Hugo Van hamme

AI总结 研究融合多个预训练语音编码器以增强基于LLM的ASR性能,提出多种融合策略并在多场景下验证其有效性。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

语音感知的大语言模型(LLMs)可以通过预训练的声学编码器将语音特征投影到LLM嵌入空间中来整合语音。虽然语音编码器的选择对性能有重要影响,但不同的编码器通常表现出互补的优势,这激发了它们的组合。在这项工作中,我们研究了融合多个预训练语音编码器是否能增强用于自动语音识别(ASR)的语音感知LLMs。我们探索了多种超越简单特征拼接的融合策略,包括学习组合和基于Transformer的融合架构,并在单语和多语ASR设置以及带说话人日志的语音识别中进行了评估。我们的结果表明,仔细融合多个并行语音编码器能在所有场景中提升下游性能,且计算开销有限。

英文摘要

Speech-aware large language models (LLMs) can incorporate speech through pre-trained acoustic encoders that project speech features into the LLM embedding space. While the choice of the speech encoder critically influences performance, different encoders often exhibit complementary strengths, motivating their combination. In this work, we investigate whether fusing multiple pre-trained speech encoders can enhance speech-aware LLMs for automatic speech recognition (ASR). We explore several fusion strategies beyond simple feature concatenation, including learned combinations and Transformer-based fusion architectures, and evaluate them across mono- and multilingual ASR settings as well as diarized speech recognition. Our results indicate that carefully fusing multiple parallel speech encoders improves downstream performance in all scenarios with limited computational overhead.

2606.10838 2026-06-10 eess.AS 新提交

Towards Deep Contextual Reasoning from Broad Descriptions for ASR with Speech-LLM via Metadata-Driven Reasoning Chains

面向语音-大语言模型的基于元数据驱动推理链的宽描述深度上下文推理

Jakob Poncelet, Hugo Van hamme

AI总结 提出一种训练方法,使语音-LLM利用宽描述作为弱语义先验,通过链式推理进行上下文修正,降低罕见词和命名实体错误率。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

语音识别在罕见领域特定术语和上下文相关的命名实体上常常失败。现有的上下文化技术通常使用关键词或短语列表来偏置解码,这难以扩展或利用更深层次的知识。我们提出一种训练方法,教会语音-LLM使用宽描述(例如来自视频的描述)作为弱语义先验,以执行基于音频的上下文推理。我们通过将错误假设与视频元数据和LLM生成的推理解释配对,构建了400小时的推理增强语音数据,这些解释证明了上下文驱动的修正。我们微调语音-LLM以执行思维链推理:生成初始转录,然后对上下文进行推理,最后返回修正后的转录。在保留的YouTube测试集上,我们的方法减少了错误,特别是在罕见词和命名实体上有所改进,并为语音识别中更深层次的上下文推理奠定了基础。

英文摘要

Speech recognition often fails on rare, domain-specific terms and context-related named entities. Existing contextualization techniques typically bias decoding with keywords or phrase lists, which does not scale well or exploit deeper knowledge. We propose a training method that teaches a speech-LLM to use broad descriptions (e.g. from videos) as weak semantic priors to perform contextual reasoning grounded in the audio. We build 400 hours of reasoning-augmented speech data by pairing erroneous hypotheses with video metadata and LLM-generated reasoning explanations that justify context-driven corrections. We finetune the speech-LLM to perform chain-of-thought reasoning: generate an initial transcript, then reason over the context, and finally return a corrected transcript. On held-out YouTube-derived test sets, our approach reduces errors, with specific improvements on rare words and named entities, and lays groundwork for deeper contextual reasoning in speech recognition.

2606.10758 2026-06-10 eess.AS 新提交

Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning

锚定未知:基于代理-锚点学习的开放集模型归因

Cristian-Teodor Neamtu, Serban Mihalache, Stefan Smeu, Dan Oneata, Horia Cucu, Dragos Burileanu

AI总结 提出基于代理-锚点损失函数的度量学习框架,利用Wav2Vec2-BERT嵌入实现TTS源归因和未知系统检测,在140个TTS系统上达到99.76%准确率和2.04%误报率。

Comments Accepted to the 34th European Signal Processing Conference (EUSIPCO 2026)

详情
AI中文摘要

能够生成逼真合成语音的文本到语音(TTS)系统的激增给音频取证带来了日益严峻的挑战。虽然二元深度伪造检测已受到广泛关注,但源追踪(即识别哪个TTS系统产生了给定的音频样本)仍未被充分探索,尤其是在可能遇到未知系统的开放集场景中。我们提出了一种基于代理-锚点损失函数的度量学习框架,该框架在Wav2Vec2-BERT嵌入上操作,以学习用于TTS源归因和未见系统分布外(OOD)检测的判别性嵌入空间。我们在涵盖51种语言、140个TTS系统的MLAAD v9数据集上进行了评估,并引入了一种架构合并策略,将TTS系统版本分组为统一类别,减少了类间混淆。我们的系统在110个分布内类别上达到了99.76%的准确率,OOD检测的假阳性率(FPR@95)低至2.04%。此外,为了与当前最先进的方法进行公平比较,我们进一步在MLAAD v5官方数据集划分上进行了评估,将OOD准确率提高了近一倍。这些结果表明,代理-锚点度量学习结合架构感知的类别设计和事后OOD评分,为闭集和开集场景下的取证TTS源追踪提供了一个有效的框架。

英文摘要

The proliferation of text-to-speech (TTS) systems capable of generating realistic synthetic speech poses growing challenges for audio forensics. While binary deepfake detection has received considerable attention, source tracing (i.e., identifying which TTS system produced a given audio sample) remains underexplored, particularly in open-set scenarios where unknown systems may be encountered. We propose a metric learning framework based on the Proxy-Anchor loss function that operates on Wav2Vec2-BERT embeddings to learn a discriminative embedding space for TTS source attribution and out-of-distribution (OOD) detection of unseen systems. We evaluate it on the MLAAD v9 dataset spanning 140 TTS systems across 51 languages, and introduce an architecture merging strategy that groups TTS system versions into unified classes, reducing inter-class confusion. Our system achieves 99.76% accuracy on 110 in-distribution classes and a False Positive Rate (FPR@95) as low as 2.04% for OOD detection. Also, for a fair comparison against the current state of the art, we further evaluate it on the MLAAD v5 official dataset splits, improving the OOD accuracy by almost doubling it. These results demonstrate that Proxy-Anchor metric learning, combined with architecture-aware class design and post-hoc OOD scoring, provides an effective framework for forensic TTS source tracing in both closed-set and open-set settings.

2606.10540 2026-06-10 eess.SP 新提交

Complex VAE with Heavy-Tailed Likelihood for Radar Target Detection in Sea Clutter

基于重尾似然的复变分自编码器在海杂波中雷达目标检测

Ting Bai, Jun Tang, Yuxin Xu

AI总结 针对海杂波重尾、尖峰特性及目标标签稀缺问题,提出无监督复变分自编码器,采用Student-t负对数似然捕获重尾重构误差,并引入时域幅度误差约束,实现恒虚警率下的雷达目标检测。

详情
AI中文摘要

为了解决海杂波的重尾、尖峰特性以及标记目标数据的稀缺性,提出了一种用于海上雷达目标检测的无监督复值变分自编码器(VAE)。在实现中,每个复基带慢时间序列由其同相和正交分量表示,模型学习仅从杂波数据中重构它们。采用Student-\(t\)负对数似然来捕获重尾重构误差,同时减少杂波学习期间对异常值的敏感性。此外,引入了时域幅度误差约束,以惩罚重构中的慢时间幅度失配。在推理时,重构偏差用作检测统计量,并通过从仅杂波验证集估计的经验分位数设置决策阈值,以实现恒虚警率(CFAR)。在实测海杂波数据上的实验表明,在CFAR约束下,检测性能相对于MF、AMF和实值\(\beta\)-VAE持续提升。

英文摘要

To address the heavy-tailed, spike-prone nature of sea clutter and the scarcity of labeled target data, an unsupervised complex-valued variational autoencoder (VAE) for maritime radar target detection is proposed. In implementation, each complex baseband slow-time sequence is represented by its in-phase and quadrature components, and the model learns their joint reconstruction from clutter-only data. A Student-\(t\) negative log-likelihood is adopted to capture heavy-tailed reconstruction errors while reducing sensitivity to outliers during clutter learning. In addition, a time-domain amplitude error constraint is introduced to penalize slow-time magnitude mismatch in the reconstruction. At inference, reconstruction deviation is used as the detection statistic, and the decision threshold is set via an empirical quantile estimated from a clutter-only validation set to enforce a constant false-alarm rate (CFAR). Experiments on measured sea-clutter data show that detection performance is consistently improved over MF, AMF, and a real-valued \(β\)-VAE under CFAR constraints.

2606.10464 2026-06-10 eess.AS 新提交

GC-LoRA: Gated Convolutional LoRA for Parameter-Efficient Acoustic Adaptation

GC-LoRA:用于参数高效声学适应的门控卷积LoRA

Natarajan Balaji Shankar, Zilai Wang, Kaiyuan Zhang, Mohan Shi, Abeer Alwan

AI总结 提出GC-LoRA适配器架构,通过注入Conformer风格的局部卷积处理到预训练Transformer编码器中,高效捕捉局部声学依赖,在多种声学失配领域实现高达10.9%的词错误率降低。

Comments Accepted for publication at Interspeech 2026

详情
AI中文摘要

基于Transformer的语音基础模型在大多数自动语音识别任务中表现出色,但在应用于声学特性不匹配的领域时,性能往往会下降。虽然参数高效微调(PEFT)方法(如低秩适应(LoRA))调整全局注意力,但它们缺乏对于捕捉领域特定变化至关重要的局部上下文建模。我们提出了GC-LoRA,一种新颖的适配器架构,将Conformer风格的局部卷积处理注入到预训练的Transformer编码器中。通过将轻量级适配器集成到编码器注意力输出投影中,我们的方法在不干扰预训练全局表示的情况下,高效地捕捉局部声学依赖。在多种数据集(声学退化、带限、方言、儿童语音)上的实验证明了我们方法的有效性,与基线相比,实现了高达10.9%的词错误率(WER)降低,同时仅增加少量可训练参数。

英文摘要

Transformer-based Speech Foundation Models excel in most Automatic Speech Recognition tasks but often suffer performance degradation when applied to domains with mismatched acoustic characteristics. While Parameter Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), adjust global attention, they lack the local context modeling crucial for capturing domain-specific variations. We propose GC-LoRA, a novel adapter architecture that injects Conformer-style local convolutional processing into pretrained Transformer encoders. By integrating a lightweight adapter to encoder attention output projections, our method efficiently captures local acoustic dependencies without disrupting pretrained global representations. Experiments across diverse datasets (acoustically-degraded, bandlimited, dialectal, child) demonstrate the efficacy of our approach, achieving Word Error Rate (WER) reductions of up to 10.9% compared to baselines while adding minimal trainable parameters.

2606.10240 2026-06-10 eess.IV 新提交

Laplace-Mixture Dipole Inversion for Quantitative Susceptibility Mapping

拉普拉斯混合偶极子反演用于定量磁化率成像

Shuai Huang, James J. Lah, Jason W. Allen, Deqiang Qiu

AI总结 提出一种基于拉普拉斯混合先验的自动偶极子反演方法(LAMDI),无需手动调参即可在定量磁化率成像中保留精细解剖结构,性能与现有方法相当。

详情
AI中文摘要

目的:开发一种用于定量磁化率成像(QSM)的自动偶极子反演方法,在无需手动调整正则化参数的情况下保留精细解剖结构。理论:原始的带参数估计的近似消息传递(AMP-PE)框架使用单一拉普拉斯先验对图像梯度建模,未能充分捕捉脑磁化率图的重尾梯度分布。这种先验不匹配可能导致过度正则化和块状重建。我们通过使用双分量拉普拉斯混合先验对梯度建模来解决这一局限性。方法:我们提出一种拉普拉斯混合偶极子反演(LAMDI)方法,将双分量拉普拉斯混合先验融入具有自动参数估计的AMP-PE框架中。LAMDI在公开的体内数据集上进行了评估。其性能与FANSI、MEDI以及使用单一拉普拉斯先验的AMP-PE(AMP-PE-L1)在标准默认设置和参考调优设置下进行了比较。结果:在公开的多方向QSM数据集上,LAMDI实现了与AMP-PE-L1相当的NRMSE和SSIM,同时显著降低了HFEN,表明其更好地保留了高频解剖细节。在基于参考的调优下,FANSI和MEDI在某些指标上达到了最佳性能,但LAMDI在无需参考图或手动正则化调优的情况下仍具有竞争力。结论:LAMDI通过结合有竞争力的重建精度和改进的精细解剖细节保留,为QSM偶极子反演提供了一种有效且自动的参数估计替代方案。

英文摘要

Purpose: To develop an automatic dipole inversion method for quantitative susceptibility mapping (QSM) that preserves fine anatomical structures without the need for manual regularization-parameter tuning. Theory: The original approximate message passing with parameter estimation (AMP-PE) framework models image gradients with a single Laplace prior, which does not fully capture the heavy-tailed gradient distribution of brain susceptibility maps. This prior mismatch can lead to over-regularization and blocky reconstructions. We address this limitation by modeling the gradients with a two-component Laplace mixture prior. Methods: We propose a Laplace-Mixture Dipole Inversion (LAMDI) method by incorporating a two-component Laplace mixture prior into the AMP-PE framework with automatic parameter estimation. LAMDI was evaluated on a public in vivo dataset. Its performance was compared with FANSI, MEDI, and AMP-PE with a single-Laplace prior (AMP-PE-L1) under both standard default and reference-tuned settings. Results: On a public multi-orientation QSM dataset, LAMDI achieved NRMSE and SSIM comparable to AMP-PE-L1 while substantially reducing HFEN, suggesting improved preservation of high-frequency anatomical detail. Under reference-based tuning, FANSI and MEDI achieved the best performance for some metrics, but LAMDI remained competitive without requiring reference maps or manual regularization tuning. Conclusion: LAMDI provides an effective and automatic parameter-estimation alternative for QSM dipole inversion by combining competitive reconstruction accuracy with improved preservation of fine anatomical detail.

2606.10190 2026-06-10 eess.SP 新提交

Optimal Illumination via Joint Movement and Phase Optimization for Movable Antenna-RIS Configuration

可移动天线-RIS配置的联合移动与相位优化的最优照明

Yan Zhang, Nicola Marchetti, Indrakshi Dey

AI总结 提出可移动天线增强RIS架构,利用随机微分方程建模天线移动,通过两时间尺度框架优化长期信噪比,实现高达36 dB稳态增益和16倍能效提升。

详情
AI中文摘要

可重构智能表面(RIS)能够实现对无线传播的可编程控制,但在静态部署中仍易受持续深度衰落的影响。本文引入了一种可移动天线增强的RIS(MA-RIS)架构,其中天线元件物理重新定位以采样独立的空间信道,从而实现移动性带来的分集。我们使用随机微分方程(SDE)框架对天线运动进行建模,该框架捕获了受控漂移和环境扩散。基于伊藤微积分的分析表征了稳态天线分布、空间去相关和中断概率,揭示了控制强度与移动随机性之间的基本权衡。为了在考虑控制开销的同时最大化长期信噪比,我们提出了一种开销感知的两时间尺度框架,将慢速天线轨迹控制与快速相位适应分离。通过汉密尔顿-雅可比-贝尔曼(HJB)公式的预测近似求解随机最优控制问题,实现了实时实现。仿真验证了理论预测:两时间尺度策略实现了高达36 dB的稳态信噪比,具有显著的稳定性,比仅位置控制高出15 dB,比未控制基线高出30 dB以上。尽管信噪比低于有源RIS,但所提出的方法在不同系统规模下实现了高达16倍的能效提升,为弹性无线系统建立了移动性驱动的信道适应新范式。

英文摘要

Reconfigurable intelligent surfaces (RIS) enable programmable control of wireless propagation but remain vulnerable to persistent deep fades in static deployments. This paper introduces a Movable Antenna-enhanced RIS (MA-RIS) architecture where antenna elements physically reposition to sample independent spatial channels, enabling mobility-induced diversity. We model antenna motion using a Stochastic Differential Equation (SDE) framework capturing controlled drift and environmental diffusion. It^o calculus-based analysis characterizes steady-state antenna distributions, spatial decorrelation, and outage probability, revealing fundamental trade-offs between control strength and mobility randomness. To maximize long-term SNR while accounting for control overhead, we propose an overhead-aware Two-timescale framework separating slow antenna trajectory control from fast phase adaptation. The stochastic optimal control problem is solved via predictive approximation of the Hamilton-Jacobi-Bellman (HJB) formulation, enabling real-time implementation. Simulations validate theoretical predictions: the Two-timescale strategy achieves up to 36 dB steady-state SNR with remarkable stability, outperforming position-only control by up to 15 dB and uncontrolled baselines by over 30 dB. Despite experiencing a lower SNR than Active RIS, the proposed approach delivers up to 16 times higher energy efficiency (EE) across varying system scales, establishing a new paradigm of mobility-enabled channel adaptation for resilient wireless systems.

2606.10164 2026-06-10 eess.SP 新提交

Curved Beam Enabled Wireless Communications: Modeling, Analysis and Optimization

弯曲波束赋能无线通信:建模、分析与优化

Jiawei Yao, Xiaoren Xu, Walid Saad, Mingzhe Chen

AI总结 针对障碍物场景,提出利用连续孔径阵列生成弯曲波束以提升无线通信性能,通过建模波束控制与分段信道,设计基于分数规划和增强块坐标上升的迭代算法优化加权和速率。

详情
AI中文摘要

本文研究了在存在障碍物的情况下,利用弯曲波束提升无线通信性能的问题。特别地,配备连续孔径阵列的发射机可以通过允许信号沿直线和弯曲路径传播来生成弯曲波束,以服务多个接收机。为了优化加权和速率,本文开发了一种弯曲波束模型,用于控制波束转向、波束聚焦和波束弯曲功能,并建立了一种分段信道模型来表征由障碍物引起的实际信道。基于所引入的弯曲波束模型,提出了一个优化问题,目标是在发射功率预算和弯曲波束物理约束下最大化所有用户的加权和速率。为了解决该问题,首先通过对连续坐标进行离散采样,将连续孔径转换为有限求和。然后,分析了理想连续孔径设计与其实际离散孔径近似之间的性能差距。基于上述离散近似,开发了一种迭代算法来优化弯曲波束控制参数。具体地,通过分数规划(FP)将原问题重新表述为可处理的形式。然后,通过设计一种增强的块坐标上升(BCA)方法来解决变换后的问题,该方法利用先前迭代的局部下降来确定代理构造点,从而加速收敛。接着,在代理函数中加入近端正则化项以控制更新幅度并抑制激进更新,从而提高更新稳定性。最后,基于有效信道增益计算波束幅度。仿真结果表明,与仅使用直线波束相比,所提方法可以改善加权和速率。

英文摘要

In this paper, the problem of using curved beams to improve wireless communication performance in the presence of a blockage is studied. In particular, a transmitter equipped with a continuous aperture array can generate curved beams to serve multiple receivers by allowing signals to propagate along both straight and curved paths. To optimize the weighted sum-rate, a curved beam model is developed for controlling the beam steering, beam focusing, and beam curving functions, along with a segmented channel model to characterize practical channels induced by the blockage. Based on the introduced curved beam model, an optimization problem is posed with the goal of maximizing the weighted sum-rate of all users under a transmit power budget and physical constraints of curved beams. To solve this problem, the continuous aperture is first converted into finite summations via a discrete sampling of the continuous coordinate. Then, the performance gap between the ideal continuous aperture design and its practical discrete aperture approximation is analyzed. Based on the above discrete approximation, an iterative algorithm is developed to optimize curved beam control parameters. In particular, the original problem is reformulated as a trackable form via fractional programming (FP). Then, the transformed problem is solved by designing an enhanced block coordinate ascent (BCA) method which determines a surrogate-construction point leveraging the local descent from previous iterations, thereby accelerating convergence. Then, a proximal regularization term is included into the surrogate function to control the update magnitude and suppress aggressive update, thereby improving updates stability. Finally, the beam amplitudes are computed based on the effective channel gains. Simulation results show that the proposed method can improve the weighted sum-rate compared to using only straight beam.

2606.10048 2026-06-10 eess.SP 新提交

Human Walking Sensing and Pose Estimation in the 6 GHz Band Using Amplitude and Phase CSI

使用幅度和相位CSI在6 GHz频段进行人体行走感知与姿态估计

Zhaorui Yin, Mattia Brambilla, Monica Nicoli

AI总结 研究利用6 GHz OFDM信号的幅度和相位CSI进行室内人体姿态估计,设计处理流程并适配四种深度学习模型,实验表明幅度CSI性能与联合幅度-相位处理相当,相位信息作为补充特征更有效。

详情
AI中文摘要

本文研究了在6 GHz频段运行的室内多基地无线网络中,利用正交频分复用(OFDM)信号进行人体姿态估计。我们设计并验证了一个处理流程,该流程利用来自多个无线电链路的信道状态信息(CSI)的幅度和相位来估计人体姿态。文献中的四种深度学习架构,即DT-Pose、MetaFi++、HPE-Li和VST-Pose,被适配到OFDM CSI结构,并扩展以联合利用幅度和相位信息。这些模型估计在网络覆盖区域内行走的人体姿态。使用标准姿态估计指标如Procrustes对齐平均每关节位置误差(PA-MPJPE)和骨骼长度损失(BLL)在开放访问数据集上进行性能评估。结果表明,从6 GHz OFDM CSI测量中可以实现可靠的人体姿态重建,其中DT-Pose提供了最佳的整体精度。平均而言,仅幅度CSI的性能与联合幅度-相位处理相当,而相位信息作为补充特征比作为独立输入更有益。

英文摘要

This paper investigates human pose estimation from Orthogonal Frequency-Division Multiplexing (OFDM) signals in an indoor multistatic wireless network operating in the 6 GHz band. We design and validate a processing pipeline that exploits both the amplitude and phase of the Channel State Information (CSI) from multiple radio links to estimate the human body pose. Four deep learning architectures from the literature, namely DT-Pose, MetaFi++, HPE-Li, and VST-Pose, are adapted to the OFDM CSI structure and extended to jointly exploit the amplitude and phase information. The models estimate the pose of a human walking within the network coverage area. Performance evaluation is conducted on an open-access dataset using standard pose-estimation metrics such as Procrustes-aligned Mean Per-Joint Position Error (PA-MPJPE) and Bone Length Loss (BLL). Results indicate that reliable human pose reconstruction can be achieved from 6 GHz OFDM CSI measurements, with DT-Pose providing the best overall accuracy. On average, amplitude-only CSI yields performance comparable to joint amplitude-phase processing, whereas phase information is more beneficial as a complementary feature rather than as a standalone input.

2606.11013 2026-06-10 stat.ME 新提交

Empirical stratification for treatment effect heterogeneity with post-treatment variables

治疗后变量处理效应异质性的经验分层

Chao Cheng, Rui Wang, Yichi Zhang

AI总结 提出一种假设精简的经验分层框架,通过基于基线协变量预测的潜在治疗后变量响应定义经验得分,构建可识别的经验分层处理效应,并连接主分层因果效应。

详情
AI中文摘要

治疗后变量(PVs),如治疗不依从、行为反应、中间事件,常常改变对主要结局的最终处理效应。然而,现有方法在研究中针对PVs的处理效应异质性方面提供的工具有限。传统的异质性处理效应估计量以基线协变量为条件。然而,类似地以观察到的PV为条件会引发处理效应估计的内生选择偏差。主分层为研究跨主分层的因果效应提供了严格的框架,但主分层是潜在的,其识别通常需要严格的假设。本文开发了一个假设精简的经验分层框架,用于表征针对PVs的处理效应异质性。我们使用基于基线协变量预测的潜在PV响应来定义经验得分,并利用经验得分构建经验上可访问的子组。由此产生的经验分层处理效应(ETEs)在标准因果假设下是可识别的。我们将所提出的框架与主分层联系起来,表明平均ETE在主忽略性假设下恢复了主因果效应,但在违反该假设时仍然具有信息量。我们进一步引入了投影ETE曲线,并开发了基于高效影响函数的半参数推断估计量。我们通过两个实际应用说明了所提出的框架。

英文摘要

Post-treatment variables (PVs), such as treatment noncompliance, behavioral responses, intercurrent events, often modify the ultimate treatment effect on the primary outcome. However, existing methods provide limited tools for studying treatment effect heterogeneity with respect to PVs. Conventional heterogeneous treatment effect estimands condition on baseline covariates. However, similarly conditioning on the observed PV can induce endogenous selection bias for the treatment effect estimation. Principal stratification offers a rigorous framework for studying principal causal effects across principal strata, but principal strata are latent and their identification often requires stringent assumptions. This paper develops an assumption-lean empirical stratification framework for characterizing treatment effect heterogeneity with respect to PVs. We define empirical scores using the predicted potential PV responses based on baseline covariates, and use the empirical scores to construct empirically accessible subgroups. The resulting empirical-stratum treatment effects (ETEs) are identifiable under standard causal assumptions. We connect the proposed framework to principal stratification by showing that the average ETE recovers principal causal effects under the principal ignorability assumption, but remains informative under violations of this assumption. We further introduce projected ETE curves and develop efficient influence function-based estimators for the semiparametric inference. We illustrate the proposed framework with two real-world applications.

2606.10969 2026-06-10 stat.ME 新提交

A Functional Data Framework For Analyzing Shapes and Textures in Images

图像形状与纹理分析的函数数据框架

Issam-Ali Moindjié

AI总结 提出一种基于函数数据分析的星形域图像表示方法,降低维度与计算成本,并应用于监督分类。

详情
AI中文摘要

图像表示由轮廓和纹理特征刻画的物体。从统计角度看,这些特征可定义为连续随机函数的观测。然而,大多数现有方法依赖于基于像素的离散化,导致高维表示和沉重的计算成本。本文介绍了一种更经济的替代表示。该表示假设物体具有星形域内部。在此条件下,我们从函数数据分析的角度探索图像分析。所提出的框架在真实数据监督图像分类问题上进行了说明。

英文摘要

Images represent objects characterized by contours and textures. From a statistical perspective these features can be defined as observations of continuous random functions. However, most existing approaches rely on pixel-based discretizations which lead to high-dimensional representations and heavy computational costs. In this note, we introduce an alternative more frugal representation. This representation assumes that the object has a star-shaped domain interior. Under this condition, we explore the analysis of images from a functional data analysis perspective. The proposed framework is illustrated on a real data supervised image classification problem.

2606.10866 2026-06-10 stat.ME stat.AP stat.CO 新提交

Adressing Separation: A Firth-corrected Joint Model for Longitudinal and Time-to-event Data with an Application on Dropout from Vocational Training

解决分离问题:纵向与时间-事件数据的Firth校正联合模型及其在职业培训辍学中的应用

Sophie Potts, Viola Deutscher, Elisabeth Bergherr

AI总结 针对联合模型中分类协变量分离导致估计偏差的问题,引入Firth校正到极大似然估计中,通过EM算法实现参数估计,模拟和实际数据表明该方法能降低偏差,并应用于德国职业培训辍学影响因素分析。

详情
AI中文摘要

纵向与时间-事件数据的联合模型常用于建模内源性纵向协变量与时间-事件结局的关系。然而,该类模型继承了生存子模型的一些局限性,包括分类协变量每个类别必须非分离。因此,我们将Firth校正引入联合模型的频率学派估计过程,使模型类适用于存在分离情况的数据集。我们推导了校正项所需的量,并在联合模型的参数估计中将其实现于期望最大化算法。我们的模拟研究表明,在存在分离问题的数据情境下,Firth校正估计过程产生更少偏差的估计,且相应系数趋近于非分离情况下观察到的估计值。在关于职业培训满意度和辍学数据集上的应用展示了Firth校正联合模型在真实世界分离数据集中的优势。结果通过明确建模社会经济和培训特定因素对辍学风险的直接效应以及它们通过培训满意度的间接贡献,补充了德国职业培训辍学研究的文献。

英文摘要

Joint Models for longitudinal and time-to-event data are frequently used to model endogenous longitudinal covariates alongside a time-to-event outcome. However, the model class borrows some limitations of the survival submodels, including the necessity for non-separation for each category of categorical covariates. We therefore incorporate Firth's correction into the frequentist estimation procedure of joint models in order to make the model class applicable in settings with separation cases. We derive the needed quantities for the correction term and implement it in the Expectation-Maximization Algorithm for the parameter estimation in joint models. Our simulation study shows, that in data situations with separation issues, the Firth-corrected estimation procedure yields less biased estimates and the respective coefficients approach the estimated values observed in the non-separation cases. The application on a data set on satisfaction with and dropouts from vocational training demonstrates the advantages of the Firth-corrected joint model in a real world data set with separation. The results add to the literature on dropout from vocational training in Germany by explicitly modeling direct effects of socioeconomic and training-specific factors on the risk of dropout as well as their indirect contribution via satisfaction with the training.

2606.10772 2026-06-10 stat.AP 新提交

Structural Under-Representation of Women in News: Nonparametric Bayesian Mixtures Capture Time-Dependent Dynamics

新闻中女性的结构性低代表性:非参数贝叶斯混合模型捕捉时间依赖动态

Isabella Habereder, Thomas Kneib, Isao Echizen, Timo Spinde

AI总结 采用时间依赖贝叶斯混合模型分析加拿大新闻数据,揭示女性引述比例在所有主题和地区中均存在结构性低代表性,且超过85%的时间序列未见改善。

详情
AI中文摘要

女性作为新闻媒体引用来源的低代表性是性别偏见的一种显著表现。理解性别偏见的集中区域及其演变方式对于有针对性的缓解至关重要。由于性别代表性随主题、时间和报道地区而变化,产生难以用参数化方法捕捉的复杂依赖关系,我们采用非参数模型来揭示潜在聚类结构和时间动态。我们将时间依赖贝叶斯混合建模技术与针对女性引述份额(介于0和1之间)的Beta混合核相结合。该模型拟合了2019年至2024年的加拿大新闻文章,揭示了所有聚类中女性的结构性低代表性,其中新闻主题对女性引述份额差异的影响比报道地区更强。超过85%的主题-地区时间序列在观察期内未显示向性别平等的改善。动态密度估计证实,女性引述份额的总体分布在2019年至2024年间保持稳定。我们的应用表明,高级概率模型不仅能复现性别偏见研究中的发现,还能揭示简单方法遗漏的潜在依赖关系和结构模式,鼓励未来采用基于模型的框架研究媒体偏见。

英文摘要

The under-representation of women as sources cited in news media is one prominent representation of gender bias. Understanding where gender bias concentrates and how it evolves is essential for targeted mitigation. Because gender representation varies across topics, time, and reported-on regions, creating complex dependencies that are difficult to capture parametrically, we employ a nonparametric model to uncover latent cluster structures and temporal dynamics. We combine time-dependent Bayesian mixture modeling techniques with a Beta mixture kernel tailored to female quote shares, bounded between 0 and 1. Fitted on Canadian news articles from 2019 to 2024, the model reveals structural under-representation of women across all clusters, with news topic driving differences in female quote shares more strongly than the reported-on region. More than 85% of topic-region time series show no improvement toward gender parity over the observation period. Dynamic density estimation confirms that the aggregate distribution of female quote shares remains stable between 2019 and 2024. Our application demonstrates that advanced probabilistic models not only reproduce findings in gender bias research but also reveal latent dependencies and structural patterns that simpler approaches miss, encouraging future adoption of model-based frameworks for studying media bias.

2606.10767 2026-06-10 stat.ME 新提交

Two-Sample Homogeneity Test via Entropic Optimal Transport

基于熵正则最优传输的两样本同质性检验

Yiming Ma, Hang Liu, Weiwei Zhuang

AI总结 提出基于熵正则最优传输映射的两样本同质性检验,利用平方L2距离作为统计量,证明可识别性、中心极限定理及局部渐近功效,并通过加权乘子自助法校准零分布。

详情
AI中文摘要

本文提出了一种基于熵正则最优传输(EOT)映射的两样本同质性检验,该映射来自一个共同的参考分布——单位球上的均匀分布。检验统计量是两个经验EOT映射之间的平方$L^2$距离。对于固定的熵正则化参数,我们证明了总体映射差异是可识别的,推导了零假设下经验映射差异的函数中心极限定理,并建立了高斯二次型零极限。我们还证明了对固定备择假设的一致性,并刻画了连续备择假设下的局部渐近功效。提出了一种加权乘子自助法来校准非枢轴零分布,并证明了其有效性。大量模拟表明,所提出的EOT映射检验具有可靠的有限样本大小控制,并且与其他现有方法相比具有竞争性的功效。该方法对于位置备择假设特别有效,并且除了单一的标量差异外,它还提供了关于两个分布如何不同的额外诊断信息。最后,一个真实数据应用结束了本文。

英文摘要

This paper proposes a two-sample homogeneity test based on entropic optimal transport (EOT) maps from a common reference distribution -- the uniform law on the unit ball. The test statistic is the squared $L^2$-distance between the two empirical EOT maps. For fixed entropic regularization parameter, we prove that the population map discrepancy is identifiable, derive a functional central limit theorem for the empirical map difference under the null, and establish the Gaussian quadratic-form null limit. We also prove consistency against fixed alternatives and characterize local asymptotic power under contiguous alternatives. A weighted multiplier bootstrap is proposed to calibrate the non-pivotal null distribution, and its validity is established. Extensive simulations demonstrate that the proposed EOT-map test has reliable finite-sample size control and exhibits competitive power compared with other existing methods. The method is particularly powerful for location alternatives and, beyond a single scalar discrepancy, it provides additional diagnostic information on how the two distributions differ. Finally, a real data application concludes the paper.

2606.10593 2026-06-10 stat.ME stat.CO 新提交

Data compression for fast dimension reduction and clustering of high-dimensional discrete data

面向高维离散数据的快速降维与聚类的数据压缩方法

Silvia D'Angelo, Michael Fop

AI总结 提出一种确定性降维框架,通过缩放位置编码的加权和将高维离散观测压缩为低维连续表示,保证单射性、近似高斯性及聚类中心可分离性,计算高效且适用于多种数据类型。

详情
AI中文摘要

高维离散数据出现在许多当代应用中,包括基因组学、微生物组研究、调查研究以及数字行为分析。对此类数据进行聚类仍然具有挑战性,因为现有方法通常计算要求高、对稀疏性和离散性敏感,或针对特定数据类型设计。我们提出了一种用于聚类高维离散观测的确定性降维框架。该方法通过由缩放位置编码定义的加权和,将每个观测压缩为低维连续表示,产生一种适用于二值、分类和计数数据的数值稳定变换。我们建立了所提出压缩的几个理论性质。该映射是单射的,确保不同的观测在压缩后保持不同。在温和的正则条件下,压缩变量近似服从高斯分布,为压缩空间中的基于模型的聚类提供了理论基础。我们进一步证明,聚类中心之间的分离度在压缩下得以保持,这意味着降维后位置驱动的聚类结构仍然可识别。广泛的模拟研究表明,在多种现实场景下聚类恢复准确。所提出的方法计算效率高,与常用于聚类的降维技术相比,速度显著提升。对爱尔兰婴儿名字记录和微生物组数据的应用进一步说明了其实用性。该框架提供了一种可扩展、计算高效且广泛适用的高维离散数据聚类方法。

英文摘要

High-dimensional discrete data arise in many contemporary applications, including genomics, microbiome research, survey studies, and digital behavioral analysis. Clustering such data remains challenging because existing methods are often computationally demanding, sensitive to sparsity and discreteness, or designed for specific data types. We propose a deterministic dimension-reduction framework for clustering high-dimensional discrete observations. The method compresses each observation into a low-dimensional continuous representation through weighted sums defined by a scaled positional encoding, yielding a numerically stable transformation applicable to binary, categorical, and count-valued data. We establish several theoretical properties of the proposed compression. The mapping is injective, ensuring that distinct observations remain distinct after compression. Under mild regularity conditions, the compressed variables admit an approximate Gaussian representation, providing a theoretical basis for model-based clustering in the compressed space. We further show that separation between cluster centroids is preserved under compression, implying that location-driven cluster structure remains identifiable after dimension reduction. Extensive simulation studies demonstrate accurate cluster recovery across a wide range of realistic settings. The proposed approach is also computationally efficient, providing substantial speed improvements over commonly used dimension-reduction techniques often used in conjunction with clustering. Applications to Irish baby-name records and microbiome data further illustrate its practical utility. The proposed framework offers a scalable, computationally efficient, and broadly applicable approach to clustering high-dimensional discrete data.

2606.10574 2026-06-10 stat.AP stat.ME 新提交

Two-stage imputation of longitudinal anthropometric data with cross-reference harmonisation: a simulation study

纵向人体测量数据的二阶段插补与交叉参考协调:一项模拟研究

Flavia Alves

AI总结 提出一种二阶段方法,通过线性插补和基于LMS方法的生长参考插补,解决纵向数据中缺失的人体测量值,并显式处理不同参考标准,模拟显示误差小且无偏。

详情
AI中文摘要

目标。纵向数据集经常缺失体重和身高测量值,而合并数据源的研究可能针对不同的生长参考标准(例如WHO参考和CDC图表)对测量值进行索引。我们描述并评估了一种可复现的二阶段方法,该方法在将参考标准的选择作为显式参数的同时,对缺失的人体测量数据进行插补。方法。阶段1在访视日期之间应用受试者内线性插值(仅内部间隙,无外推)。阶段2使用LMS方法,通过估计每个受试者的百分位数,在受试者内向前和向后携带该百分位数,当受试者从未被测量时默认使用第50百分位数,并从访视年龄的参考标准中读取期望值,从而从年龄和性别特异性生长参考中插补剩余值。可以为每个数据源提供不同的参考标准,以便记录和审计所应用的标准。我们通过掩盖并重新插补随机20%的观测值来评估恢复准确性。所有评估均使用计算机生成的合成数据。结果。在合成数据(n=60名受试者,288次访视,30%缺失)上,该方法将缺失率解决为100%完整。掩盖值恢复的体重平均绝对误差为1.78 kg(平均绝对百分比误差3.5%),身高为2.84 cm(2.0%),偏差可忽略。受试者内插值恢复的值比从生长参考恢复的值更准确,符合预期,支持二阶段顺序。结论。该方法提供了一种简单、无依赖且可审计的人体测量插补方法,显式处理不同的参考标准和每个值的来源。在用于实质性分析之前,下一步必要的工作是应用于实证数据并将插补不确定性传播到下游模型中。

英文摘要

Objective. Longitudinal datasets frequently contain missing weight and height measurements, and studies that combine data sources may index measurements against different growth reference standards (e.g., the WHO reference and CDC charts). We describe and evaluate a reproducible two-stage method that imputes missing anthropometry while making the choice of reference standard an explicit parameter. Methods. Stage 1 applies within-subject linear interpolation across visit dates (interior gaps only, no extrapolation). Stage 2 imputes remaining values from an age- and sex-specific growth reference using the LMS method by estimating each subject's centile, carrying it forward and backwards within the subject, defaulting to the 50th centile when a subject is never measured, and reading the expected value off the reference at the visit age. Different references can be supplied per data source so that the standard applied is recorded and auditable. We assessed recovery accuracy by masking and re-imputing a random 20% of observed values. All evaluations used computer-generated synthetic data. Results. On synthetic data (n = 60 subjects, 288 visits, 30% missing), the method resolved missingness to 100% completeness. Masked-value recovery gave a mean absolute error of 1.78 kg for weight (3.5% mean absolute percentage error) and 2.84 cm for height (2.0%), with negligible bias. Values recovered by within-subject interpolation were more accurate than those recovered from the growth reference, as expected, supporting the two-stage ordering. Conclusion. The method offers a simple, dependency-free, and auditable approach to anthropometric imputation, with explicit handling of differing reference standards and per-value provenance. Application to empirical data and propagation of imputation uncertainty into downstream models are the necessary next steps before use in substantive analyses.

2606.10563 2026-06-10 stat.ME 新提交

Predicting Current Outcomes From Historical Survey Data With Weighted Conformal Prediction

基于加权共形预测从历史调查数据预测当前结果

Chihoon Lee, Sungkyu Jung, Hyokyung G. Hong

AI总结 针对大规模调查中部分结果仅在特定年份测量的缺失问题,提出加权共形预测框架,通过估计历史与目标协变量分布间的似然比,实现有效的总体水平预测,并保证覆盖概率。

Comments Submitted to Journal of the Royal Statistical Society Series B. 89 pages, 14 figures. Includes supplementary material

详情
AI中文摘要

在诸如国家健康与营养调查(NHANES)等大规模复杂调查中,某些结果仅在选定的年份进行测量,导致不同调查波次间记录不完整。我们开发了一个加权共形预测框架,能够利用早期调查的信息对未观测到的结果进行有效的总体水平预测。该方法适应协变量偏移,其中连续和分类协变量的分布随时间演变,同时调查设计影响代表性。它整合了子组特定的密度比和子组比例估计,以近似历史与目标协变量分布之间的似然比,并且我们为所得预测集建立了覆盖保证。模拟研究和一项预测当前美国人口低密度脂蛋白胆固醇(LDL-C)的应用表明,所提出的方法实现了接近名义水平的覆盖,并且在效率上优于现有方法,特别是在协变量分布复杂或未知的情况下。

英文摘要

In large-scale complex surveys such as the National Health and Nutrition Examination Survey (NHANES), some outcomes are measured only in selected years, leaving incomplete records across survey waves. We develop a weighted conformal prediction framework that enables valid population-level prediction of unobserved outcomes using information from earlier surveys. The method accommodates covariate shift, where both continuous and categorical covariate distributions evolve over time while survey design affects representativeness. It integrates subgroup-specific density ratio and subgroup-proportion estimation to approximate likelihood ratios between the historical and target covariate distributions, and we establish coverage guarantees for the resulting prediction sets. Simulation studies and an application predicting low-density lipoprotein cholesterol (LDL-C) for the current U.S. population show that the proposed approach achieves coverage close to the nominal level and improved efficiency over existing methods, particularly when covariate distributions are complex or unknown.

2606.10409 2026-06-10 stat.ME 新提交

Robust Bayesian Predictive Model Selection using Bregman Divergence

使用Bregman散度的稳健贝叶斯预测模型选择

Jongwoo Choi, Neil A. Spencer, Dipak K. Dey

AI总结 针对基于对数得分的ELPD对异常值和尾部不匹配敏感的问题,提出基于Bregman散度的广义ELPD框架,通过β-散度族控制低密度观测影响,实现稳健模型选择。

详情
AI中文摘要

预测性贝叶斯模型比较通常依赖于留一法交叉验证准则,如期望对数预测密度(ELPD)。然而,由于ELPD基于对数得分,模型排名可能对异常值和尾部不匹配过于敏感。我们提出一个得分匹配的广义ELPD框架,用Bregman评分规则替换对数得分,通过广义后验更新模型参数并评估留一法预测效用。候选后验预测分布根据所选评分规则下的样本外效用进行排序,从而得到标准ELPD的直接正确得分推广。我们特别关注β-散度族,其中β控制预测比较对低密度观测的敏感性。在模型误设定下,该过程渐近选择预测分布与数据生成过程在所选Bregman散度下最接近的模型。模拟研究和微生物及法医数据应用表明,广义ELPD通过降低对低密度观测的敏感性可以改变所选模型。

英文摘要

Predictive Bayesian model comparison often relies on leave-one-out (LOO) cross-validation criteria such as the expected log predictive density (ELPD). However, model rankings can be overly sensitive to outliers and tail mismatch because ELPD is based on the log score. We propose a score-matched generalized ELPD framework that replaces the log score by a Bregman scoring rule to update model parameters through a generalized posterior and to evaluate LOO predictive utility. Candidate posterior predictive distributions are ranked by out-of-sample utility under the chosen scoring rule, yielding a direct proper-score generalization of standard ELPD. We focus especially on the $β$-divergence family, where $β$ controls the sensitivity of predictive comparison to low-density observations. Under model misspecification, the procedure asymptotically selects the model whose predictive distribution is closest to the data-generating process under the chosen Bregman divergence. A simulation study and applications to microbial and forensic data show that the generalized ELPD can change the selected model through reduced sensitivity to low-density observations.

2606.10342 2026-06-10 stat.AP 新提交

Binomial Smoothing for Inventory and Information Control in Supply Chains

供应链中库存与信息控制的二项式平滑

Rene Caldentey, Avi Giloni, Clifford Hurvich, Prem Talwai, Yichen Zhang

AI总结 针对分散供应链中零售商订单平滑与上游预测的权衡,提出二项式平滑策略,在最小化制造商预测误差的同时保持可逆性,并实现常数因子近似最优。

Comments 59 pages, 7 figures, 4 tables

详情
AI中文摘要

在许多分散的供应链中,上游企业不直接观察市场需求,而是从订单流推断下游状况。因此,零售商的补货策略扮演双重角色:它管理库存补货并塑造上游预测可用的信息。这产生了一个基本权衡:更平滑的订单提高上游可预测性,但延迟对需求的响应可能增加下游库存成本。我们研究在一个由一个零售商和一个制造商组成的两层供应链中,当制造商根据零售商的订单历史预测未来订单时,零售商应如何最优地平滑需求。我们提出二项式平滑,一类补货策略,通过使用二项式权重将每个需求单位分散到有限时间范围内来实现延迟需求响应。该类策略可解释、易于校准且解析易处理。在满足温和正则条件的弱平稳高斯需求下,我们证明,对于任何固定平滑时间范围,在所有具有相同平滑程度的策略中,二项式策略最小化制造商的预测误差。它保持可逆性,因此制造商可以从观察到的订单中恢复需求历史。更一般地,二项式平滑相对于最优策略实现了常数因子近似保证。我们的结果产生更广泛的见解:补货策略的设计不应仅仅像传统牛鞭效应度量那样减少订单方差,而应减少订单的不可预测成分。精心设计的平滑可以提高供应链绩效并部分替代信息共享,为无需协作的协调提供具体机制。

英文摘要

In many decentralized supply chains, upstream firms do not observe market demand directly and instead infer downstream conditions from the order stream. A retailer's replenishment policy therefore plays a dual role: it governs inventory replenishment and shapes the information available for upstream forecasting. This creates a fundamental trade-off. Smoother orders improve upstream predictability, but delaying the response to demand can increase downstream inventory costs. We study how a retailer should optimally smooth demand in a two-tier supply chain with one retailer and one manufacturer when the manufacturer forecasts future orders from the retailer's order history. We propose Binomial Smoothing, a class of replenishment policies that implements delayed demand response by spreading each unit of demand over a finite horizon using binomial weights. The class is interpretable, easy to calibrate, and analytically tractable. Under weakly stationary Gaussian demand satisfying mild regularity conditions, we show that, for any fixed smoothing horizon, the Binomial policy minimizes the manufacturer's forecast error among all policies with the same degree of smoothing. It remains invertible, so the manufacturer can recover demand history from observed orders. More generally, Binomial Smoothing achieves a constant-factor approximation guarantee relative to an optimal policy. Our results yield a broader insight: replenishment policies should be designed not merely to reduce order variance, as in the traditional bullwhip measure, but to reduce the unpredictable component of orders. Carefully designed smoothing can improve supply-chain performance and partially substitute for information sharing, providing a concrete mechanism for coordination without collaboration.

2606.10224 2026-06-10 stat.ME stat.AP 新提交

Spatial Prediction of Local Soil Erosion Distribution in the Wasserstein Space

Wasserstein空间中局部土壤侵蚀分布的空间预测

Jiaming Qiu, Xiongtao Dai, Zhengyuan Zhu, Shuiqing Yin

AI总结 提出一种将局部侵蚀分布视为Wasserstein空间对象,通过基展开和多元随机场建模,结合局部回归和克里金法进行空间预测的新方法,在模拟和陕西省实际数据中优于现有方法。

Comments To appear in the Annals of Applied Statistics

详情
AI中文摘要

获取精确的侵蚀测量需要昂贵的实地工作,使得直接调查大范围区域(如省或流域)不可行。为了将实地结果扩展到如此广阔的区域,我们提出了一种新颖的空间预测方法,将局部侵蚀分布视为Wasserstein空间中的对象。这些分布被映射为平方可积轨迹,并通过基展开表示,形成捕捉空间依赖性的多元随机场。通过在这种表示中应用局部回归和克里金法,我们的方法灵活地建模和预测任意位置的侵蚀分布。该框架改进了对分布泛函(如均值和超越概率)的预测。模拟研究表明,所提出的方法优于错误指定的参数替代方法和现有的Fréchet回归方法。我们通过中国陕西省的详细侵蚀分析说明了该方法,其中将来自调查流域的局部测量结果扩展到使用土地利用和海拔等协变量预测整个省的侵蚀分布。

英文摘要

Obtaining precise erosion measurements requires costly fieldwork, making it infeasible to directly survey large domains such as a province or river basin. To extend fieldwork results across such extensive domains, we propose a novel spatial prediction method that treats local erosion distributions as objects in the Wasserstein space. These distributions are mapped into square-integrable trajectories and represented via basis expansion, forming a multivariate random field that captures spatial dependence. By applying local regression and Kriging in this representation, our approach flexibly models and predicts erosion distributions at arbitrary locations. This framework improves prediction for functionals of the distribution, such as the mean and exceedance probabilities. Simulation studies demonstrate that the proposed method outperforms a misspecified parametric alternative and existing Fréchet regression approaches. We illustrate the approach with a detailed erosion analysis in Shaanxi province, China, where local measurements from surveyed watersheds are extended to predict erosion distributions across the entire province using covariates such as land use and elevation.

2606.10123 2026-06-10 stat.ME 新提交

Methods for adjusting for covariate measurement error in flexible modelling of functional form: results of a blinded, controlled neutral comparison simulation study

在函数形式的灵活建模中调整协变量测量误差的方法:一项盲法、受控中性比较模拟研究的结果

Mohammed Sedki, Aris Perperoglou, Anne C. M. Thiébaut, Steve Ferreira Guerra, Paul Gustafson, Frank E. Harrell, Willi Sauerbrei, Michal Abrahamowicz, Laurence S. Freedman

AI总结 通过盲法多阶段中性比较模拟研究,评估了六类测量误差校正方法与四种灵活回归模型结合在非线性关联估计中的表现,发现点态SIMEX最准确稳健,贝叶斯方法和回归校准次之,多重插补较差,B样条最差。

详情
AI中文摘要

协变量测量误差在流行病学研究中普遍存在,并扭曲估计的暴露-结果关联,然而校正方法几乎仅在线性建模假设下研究。当潜在关联是非线性且本身通过灵活回归估计时,这些方法的行为仍不清楚。我们报告了一项在STRATOS倡议内进行的盲法、多阶段中性比较模拟研究,评估了测量误差校正与函数形式灵活建模的结合。六类校正方法(点态和基于系数的模拟外推[SIMEX]、对数尺度和风险尺度的贝叶斯推断、多重插补[MI]和回归校准[RC])分别与B样条(BS)、惩罚样条(PS)、分数多项式(FP)和自然样条(NS)结合,产生了23种分析方法。这些方法应用于在五种函数形式(J形、线性、两种阈值模型和饱和模型)下生成的病例对照数据,跨越不同样本量、重复子研究规模、误差幅度和误差分布的数据集,采用经典加性误差和用于误差校准的重复子研究。性能通过暴露分布中心95%范围内估计函数的对数均方误差进行评估。点态SIMEX总体最准确且最稳健,其次是贝叶斯方法和与PS、FP或NS配对的RC;MI表现较差,而使用无惩罚BS的贝叶斯估计表现最差。PS、FP和NS几乎等效,而BS始终较差。没有单一方法在所有场景中占主导地位,强调了敏感性分析的价值。

英文摘要

Covariate measurement error is pervasive in epidemiological research and distorts estimated exposure-outcome associations, yet correction methods have been studied almost exclusively under linear modelling assumptions. Their behaviour when the underlying association is non-linear and is itself estimated with flexible regression, remains poorly characterised. We report a blinded, multi-stage neutral comparison simulation study, conducted within the STRATOS initiative, evaluating measurement error correction coupled with flexible modelling of functional form. Six families of correction methods (pointwise and coefficient-based Simulation Extrapolation [SIMEX], Bayesian inference on the logit and risk scales, Multiple Imputation [MI], and Regression Calibration [RC]) were each combined with B-splines (BS), penalised splines (PS), fractional polynomials (FP), and natural splines (NS), yielding 23 analytic methods. Methods were applied to case-control data generated under five functional forms (J-shape, linear, two threshold models, and saturation) across simulated datasets spanning varying sample sizes, replication substudy sizes, error magnitudes, and error distributions, with classical additive error and a replication substudy for error calibration. Performance was assessed by the log mean squared error of the estimated function over the central 95 % of the exposure distribution. Pointwise SIMEX was the most accurate and most robust approach overall, followed by Bayesian methods and RC when paired with PS, FP, or NS; MI performed less well, and Bayesian estimation with unpenalised BS performed worst. PS, FP, and NS were near-equivalent, whereas BS was consistently inferior. No single method dominated across all scenarios, underscoring the value of sensitivity analyses.

2606.10096 2026-06-10 stat.ME 新提交

Estimating the Wasserstein barycenter of one-dimensional distributions under sparse sampling

稀疏采样下一维分布的Wasserstein重心估计

James Peng, Florian Stijven, Linbo Wang, Peter B. Gilbert

AI总结 针对每个单元仅通过少量独立同分布样本观测到一维分布的数据,提出边际构造重心(MCB)估计量,通过二项混合方法估计潜在分位数分布,克服稀疏采样下经验Wasserstein重心的偏差,并证明其一致性和渐近正态性。

详情
AI中文摘要

我们研究稀疏采样下的分布数据,其中每个单元由实直线上的概率分布表示,仅通过少量独立同分布样本观测。一维分布数据的一个自然中心趋势概念是Wasserstein重心,其分位数函数是单元级分位数函数的逐点平均。我们关注Wasserstein重心分位数函数的逐点估计:在给定分位数水平下,目标是相应单元级分位数的总体均值。一个朴素的插件估计量是经验Wasserstein重心,它将观测到的单元级经验分布视为真实的潜在单元级分布。然而,在稀疏采样下,该估计量可能存在严重偏差。我们提出了一种避免直接估计单元级分布或分布总体分布的方法。我们从更宏大的目标开始:刻画给定分位数水平下潜在单元级分位数的分布。我们证明该分布可以用单元级CDF值的边际分布表示,而后者可以通过二项混合方法估计。这激发了我们的估计量——边际构造重心(MCB)估计量,通过取估计的潜在单元级分位数分布的均值得到。我们建立了MCB估计量逐点一致且渐近正态的条件,并通过模拟表明,在稀疏采样下它能够显著优于经验Wasserstein重心。我们在HVTN 502/503疫苗效力试验的HIV-1序列数据分析中说明了该方法,当每个参与者只有少量序列可用时,使用重心来总结和比较参与者内部病毒序列特征的分布。

英文摘要

We study distributional data under sparse sampling where each unit is represented by a probability distribution on the real line observed only through a small i.i.d.~sample. A natural notion of central tendency for one-dimensional distributional data is the Wasserstein barycenter, whose quantile function is the pointwise average of the unit-level quantile functions. We focus on pointwise estimation of the Wasserstein barycenter quantile function: at a given quantile level, the target is the population mean of the corresponding unit-level quantiles. A naive plug-in estimator is the empirical Wasserstein barycenter, which treats observed unit-level empirical distributions as the true latent unit-level distributions. Under sparse sampling, however, this estimator can be severely biased. We propose an approach that avoids directly estimating either the unit-level distributions or the full population law of distributions. We start with the more ambitious goal of characterizing the distribution of latent unit-level quantiles at a given quantile level. We show that this distribution can be written in terms of the marginal distributions of the unit-level CDF values, which can be estimated using binomial mixture methods. This motivates our estimator, the marginal-constructed barycenter (MCB) estimator, obtained by taking the mean of the estimated distribution of latent unit-level quantiles. We establish conditions under which the MCB estimator is pointwise consistent and asymptotically normal, and show through simulations that it can substantially outperform the empirical Wasserstein barycenter under sparse sampling. We illustrate the method in an analysis of HIV-1 sequence data from the HVTN 502/503 vaccine efficacy trials, using the barycenter to summarize and compare within-participant distributions of viral sequence features when only a small number of sequences are available per participant.

2606.10093 2026-06-10 stat.AP stat.ME 新提交

Predicting Hospitalization from a Whole-Person Health Score with Incomplete Electronic Health Records Data: A Case Study

从不完整的电子健康记录数据中的全人健康评分预测住院:一项案例研究

Grayson E. Weavil, Joseph Rigdon, Sarah C. Lotspeich

AI总结 本研究利用统计建模和机器学习,从不完整的电子健康记录中计算全因负荷指数(ALI),并评估其预测住院的能力,发现模式子模型方法在样本内表现最佳(AUC=0.73),但交叉验证效果较差(AUC=0.63)。

Comments 13 pages, 5 figures, 2 tables, R code and simulated dataset available on GitHub

详情
AI中文摘要

将标准化的全人健康测量嵌入电子健康记录(EHR)可能对预防性护理至关重要。全因负荷指数(ALI)由三个身体系统的十个压力源成分计算得出,提供了整体健康的有前景的快照。ALI可以从EHR数据计算,但许多成分缺失,因为并非所有患者都接受所有测试。使用统计建模和机器学习,来自大型学术健康系统的$1000$名患者的EHR数据被用于从ALI预测住院(作为计数或二元变量),并控制年龄和性别。评估了各种方法来填补患者缺失的ALI成分的信息空白,包括结合成分或单独使用它们的汇总度量。性能通过受试者工作特征(ROC)曲线和相应的ROC曲线下面积(AUC)来衡量。住院的计数建模并未优于二元建模,逻辑回归优于随机森林。总体而言,汇总度量表现相似,其中完整病例比例(即“不健康”的非缺失成分比例)表现最佳(AUC $= 0.64$),但差异$\leq 0.01$。当单独使用成分时,模式子模型方法在样本中最准确地预测了住院(AUC $= 0.73$),但交叉验证效果不佳(AUC $= 0.63$)。所有汇总度量表现相似。然而,当单独包含ALI成分时,为具有相同缺失数据模式的患者子集定制模型表现最佳。下一步包括实施EHR以实现预测并支持临床决策者大规模决策。

英文摘要

Embedding a standardized whole-person health measure in electronic health records (EHR) could be instrumental to preventative care. The allostatic load index (ALI), calculated from ten component stressors across three body systems, offers a promising snapshot of holistic health. The ALI can be calculated from EHR data, but many components are missing, since not all patients undergo all tests. Using statistical modeling and machine learning, EHR data for $1000$ patients from a large academic health system were used to predict in-patient hospitalization (as a count or binary) from ALI, controlling for age and sex. Various methods were evaluated to fill in information gaps for patients' missing ALI components, including summary measures combining components or using them separately. Performance was measured using receiver operating characteristic (ROC) curves and corresponding areas under the ROC curve (AUC). Count modeling of hospitalization did not improve upon binary, and logistic regression beat random forest. Overall, summary measures performed similarly, with the complete-case proportion (i.e., the proportion of non-missing components that were "unhealthy") performing best (AUC $= 0.64$) but by $\leq 0.01$. When using components separately, the pattern submodel approach most accurately predicted hospitalization (AUC $= 0.73$) in sample, but did not cross-validate as well (AUC $= 0.63$). All summary measures performed similarly. However, when including the ALI components separately, tailoring models to subsets of patients with the same missing data pattern performed best. Next steps include EHR implementation to enable prediction and support clinician decision-making at scale.