ChainzRule: Sample-Efficient, Robust Deep Learning Across Tabular, NLP, and Vision Tasks
ChainzRule: 跨表格、NLP和视觉任务的样本高效、鲁棒深度学习
Rowan Martnishn
AI总结 提出ChainzRule架构,用可学习多项式层替代激活函数,结合微分正则化,通过限制中间导数实现低频率、结构稳定的表示,在多个领域以更少数据和标准推理成本取得更优性能。
详情
跨企业领域的生产深度学习系统在学术基准通常掩盖的约束下运行:标记数据昂贵,推理预算紧张,无法解释其行为的模型难以信任和维护。我们提出ChainzRule (CR),一种神经架构,用可学习多项式层替代典型激活函数,这些层由微分正则化(DREG)驱动,这是一种在前向传播期间以标准推理成本分析计算的逐层雅可比惩罚。核心主张是,限制中间导数迫使网络走向低频、结构稳定的表示,同时减少对标记数据量的依赖,提高对分布偏移的鲁棒性,并提供可测量的、基于梯度的模型行为处理手段。在五个领域的评估中,CR在Pima糖尿病数据集上达到$85.71\% \pm 2.01\%$(统计上优于SVM和XGBoost),在SST-5情感分类上使用冻结编码器达到$46.20\% \pm 0.37\%$(优于使用约5%训练数据的RNTN),在SST-5上使用微调BERT骨干达到$55.79\%$(对比BERT-base线性头的$54.9\%$),在Yelp Full序数回归上使用3.2M参数达到$70.17\%$(对比10模型平均$66.35\%$),在CIFAR-10-C上平均损坏准确率提升$+2.32\%$。所有报告$p$值的结果在Bonferroni校正后均低于$\alpha=0.05$阈值。CR在所有数据分数下保持梯度尾部比率$\tau$(p99/均值)为$1.01$--$1.02$,而所有典型激活函数基线为$1.07$--$1.09$,我们提出这一结构不变性作为样本效率的机制驱动因素和部署时模型可靠性的代理指标。
Production deep learning systems across enterprise domains operate under constraints that academic benchmarks routinely obscure: labeled data is expensive, inference budgets are tight, and models that cannot explain their behavior are difficult to trust and maintain. We present ChainzRule (CR), a neural architecture replacing typical activations with learnable polynomial layers governed by Differential Regularization (DREG), a layer-wise Jacobian penalty computed analytically during the forward pass at standard inference cost. The core claim is that bounding intermediate derivatives forces the network toward low-frequency, structurally stable representations, simultaneously reducing dependence on labeled data volume, improving robustness to distribution shift, and providing a measurable, gradient-based handle on model behavior. Evaluated across five domains, CR achieves $85.71\% \pm 2.01\%$ on Pima Diabetes (statistically superior to SVM and XGBoost), $46.20\% \pm 0.37\%$ on SST-5 sentiment classification with a frozen encoder (superior to RNTN using approximately 5\% of its training data), $55.79\%$ on SST-5 with a fine-tuned BERT backbone (versus BERT-base linear head at $54.9\%$), $70.17\%$ on Yelp Full ordinal regression with 3.2M parameters versus a 10-model average of $66.35\%$, and $+2.32\%$ mean corruption accuracy on CIFAR-10-C. All results with reported $p$-values fall below the $α= 0.05$ threshold after Bonferroni correction. CR maintains a gradient tail ratio $τ$ (p99/mean) of $1.01$--$1.02$ against $1.07$--$1.09$ for all typical activation function baselines across every data fraction, a structural invariant we propose as the mechanistic driver of sample efficiency and a deployment-time proxy for model reliability.