GENEB: Why Genomic Models Are Hard to Compare
GENEB:为什么基因组模型难以比较
Daria Ledneva, Mikhail Nuridinov, Denis Kuznetsov
AI总结 针对基因组基础模型评估碎片化的问题,提出GENEB基准,通过统一探测协议在100项任务上比较40个模型,揭示模型排名不稳定、规模收益有限等关键发现。
详情
- Comments
- change first page figure, fix model sizes, add more consistency
由于基准碎片化、评估协议不兼容以及任务特定报告,基因组基础模型的进展难以评估。因此,关于模型优越性或通用性的声明往往无法直接比较。我们引入GENEB,这是一个大规模诊断基准,在统一的基于探测的协议下(包括少样本场景),评估来自40个基因组基础模型的冻结表示,涵盖100个任务,跨越13个功能类别。GENEB能够在明确暴露任务级权衡的同时,对模型规模、架构、分词和预训练数据进行受控比较。我们的分析表明,整体排行榜不稳定:模型排名在不同任务类别间变化剧烈,规模仅带来适度且不一致的收益,而架构和预训练对齐常常超过参数数量的影响。这些结果凸显了当前评估实践的局限性,并将GENEB定位为基因组机器学习中原则性比较和类别感知模型选择的参考框架。
Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.