PROTOCOL: Late Interaction Retrieval for Protein Homolog Search
PROTOCOL: 用于蛋白质同源搜索的延迟交互检索
Gabrielle Cohn, Rohan Gumaste, Minh Hoang, Vihan Lakshman
AI总结 提出ProtoCol模型,利用ColBERT风格的延迟交互机制对残基嵌入进行最大相似度评分,以提升远程同源搜索的灵敏度,在SCOPe超家族和Pfam clan基准上优于多种基线方法。
详情
蛋白质同源搜索是功能注释、结构预测和进化分析的基础,但在全局序列相似性较弱且经典比对方法灵敏度下降的“模糊区”中仍然具有挑战性。蛋白质语言模型提供了上下文感知的表示,可以在此范围内提高比对灵敏度。然而,先前的基于蛋白质嵌入的检索流程通常将这些表示池化为单个向量,可能掩盖揭示远程同源性的局部基序、结构域或保守残基。我们引入了ProtoCol,该模型将蛋白质表示为残基嵌入的集合,并使用ColBERT风格的延迟交互来测试残基级比较是否改善同源检索。ProtoCol独立编码蛋白质,保持候选表示可预计算,并通过残基嵌入上的MaxSim对候选进行评分。在SCOPe超家族和Pfam clan基准上,ProtoCol优于基于序列组成、比对、池化PLM和训练的单向量基线,支持延迟交互作为远程同源搜索的有效检索层。
Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose sensitivity. Protein language models provide context-aware representations that could improve alignment sensitivity in this regime. However, prior protein embedding-based retrieval pipelines often pool these representations into a single vector, potentially obscuring local motifs, domains, or conserved residues that reveal remote homology. We introduce ProtoCol, a model which represents proteins as sets of residue embeddings and uses ColBERT-style late interaction to test whether residue-level comparison improves homolog retrieval. ProtoCol encodes proteins independently, keeps candidate representations pre-computable, and scores candidates with MaxSim over residue embeddings. On SCOPe superfamily and Pfam clan benchmarks, ProtoCol outperforms sequence-composition, alignment-based, pooled PLM, and trained single-vector baselines, supporting late interaction as an effective retrieval layer for remote homology search.