Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction

doi:10.1371/journal.pcbi.1010793

Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction

Fig 1

The procedures of ATGO for protein function prediction.

(a) The workflow of ATGO. Starting from the input sequence, the ESM-1b transformer is utilized to generate the feature embeddings from the last three layers, which are fused by a fully connected neural network. The fused feature embedding is then fed into a triplet network to create confidence scores of GO models. (b) The structure of ESM-1b transformer. For an input sequence, the masking, one-hot encoding, and position embedding are orderly executed to generate the coding matrix, which is then fed into a self-attention block with n layers. Each layer can output a feature embedding matrix from an individual evolutionary view through integrating m attention heads with a feed-forward network, where the scale dot-product attention is performed in each head. (c) The design of a triplet network for assessing feature similarity. The input is a triplet variable (anc, pos, neg), where anc is an anchor (baseline) protein, and pos (or neg) is a positive (or negative) protein with the same (or different) function of anc. Each sequence is fed into the designed feature generation model to extract a feature embedding vector, as the input of fully connected layer to output a new embedding vector. Then, the feature dissimilarity between two proteins is measured by Euclidean distance of embedding vectors. Finally, a triplet loss is designed to enhance the relationship between functional similarity and feature similarity in embedding space.

doi: https://doi.org/10.1371/journal.pcbi.1010793.g001