A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data
一种机器学习辅助的渐进式数字随机性筛查框架,用于检测原始数值研究数据中的非随机模式
Zhuphua Cao
AI总结 提出FDRS框架,结合统计与机器学习方法检测数值数据中的非随机数字模式,通过酶学吸光度数据集和模拟异常数据验证,能有效分级风险。
详情
原始数值数据集在完整性筛查中受到的关注少于图像、抄袭或汇总统计不一致性。我们开发了造假风险数字随机性筛查模型(FDRS),这是一个统计和机器学习框架,用于检测数值研究数据中的非随机数字模式不规则性。FDRS整合了单小数位和联合小数位检验、Cramer's V、熵度量、Kullback-Leibler散度、数字偏好指数、渐进子采样和半监督风险评分。使用仪器衍生的酶促吸光度数据集(RawData,n=253)和盲法手动模拟不规则数据集(ErrData,n=255)进行评估。RawData在单个第三小数位分析中未显示显著偏差,而ErrData显示显著偏差。在联合第三-第四小数位分析中,ErrData显示出更高的Cramer's V、更低的归一化熵、更高的KL散度以及更持久的渐进子采样偏差信号。在内部验证中,弹性网络逻辑回归取得了最高的AUC(0.98395)和最低的Brier分数(0.048439),而随机森林取得了最高的准确率(0.926667)和平衡准确率(0.935)。RawData获得了0.124627的低集成风险评分,被分类为0级;ErrData获得了0.740760的评分,被分类为3级。外部真实世界基准支持分级风险分层:三个未发现公开出版后问题的数据集被分类为0级或1级,而两个来自公开质疑或机构处理文章的数据集被分类为2级或3级。FDRS通过整合可解释的统计和机器学习特征,可以优先考虑对原始数值数据集进行进一步审查。它是一个辅助性的数字结构筛查工具,而非造假或不当行为的独立证据。
Raw numerical datasets remain less systematically examined in integrity screening than images, plagiarism, or summary-statistic inconsistencies. We developed the Fabrication-risk Digit Randomness Screening model (FDRS), a statistical and machine-learning framework for detecting non-random digit-pattern irregularities in numerical research data. FDRS integrates single- and joint-decimal-digit tests, Cramer's V, entropy metrics, Kullback-Leibler divergence, digit-preference indices, progressive subsampling, and semi-supervised risk scoring. It was evaluated using an instrument-derived enzymatic absorbance dataset (RawData, n=253) and a blinded manually simulated irregular dataset (ErrData, n=255). RawData showed no significant deviation in single third-decimal-digit analysis, whereas ErrData showed a significant deviation. In joint third-fourth decimal digit analysis, ErrData showed higher Cramer's V, lower normalized entropy, higher KL divergence, and a more persistent progressive-subsampling deviation signal. In internal validation, Elastic-net Logistic Regression achieved the highest AUC (0.98395) and lowest Brier score (0.048439), while Random Forest achieved the highest accuracy (0.926667) and balanced accuracy (0.935). RawData received a low ensemble risk score of 0.124627 and was classified as Grade 0; ErrData received a score of 0.740760 and was classified as Grade 3. External real-world benchmarks supported graded risk stratification: three datasets without identified public post-publication concerns were classified as Grade 0 or 1, whereas two datasets from publicly questioned or institutionally handled articles were classified as Grade 2 or 3. FDRS can prioritize raw numerical datasets for further review by integrating interpretable statistical and machine-learning features. It is an auxiliary digit-structure screening tool, not standalone evidence of fabrication or misconduct.