SNP DISTRIBUTED REPRESENTATION USING ENTITY EMBEDDING

. A single Nucleotide Polymorphism (SNP) array is the largest variation of genetic information to detect specific traits in organisms. SNP is located in a specific locus of DNA sequences. To the day this study was conducted, the representation of SNPs for machine learning models is still questionable. Based on the previous works, we proposed a comparative study of distributed representation methods against SNPs data. This study used 1,232 SNPs from the genomic data of 687 Indonesian rice samples collected from four distinct rice fields. The SNP data used was converted into an encoded format. Entity embedding (Embedder) and several comparative models, i.e., Node2Vec, Struc2Vec, and LINE, were chosen to predict the rice yield of the SNP data. The entity embedding using Embedder outperformed the comparative methods used in this study, namely Node2Vec, Struc2Vec, and LINE with the best R2 and MSE scores of 0.9368 and 0.2425 respectively.

Biologically, SNP is located in a specific position (locus) of DNA sequences [6]- [8]. The polymorphism (which means "many forms") or variation at each locus lies across the population.
Therefore, SNP data cannot be treated the same as the DNA sequence, even though the SNP itself is a unique part of DNA. A high-performance data storing system, as well as an appropriate way of processing this data, are required due to the vast volume of SNP data [9]. The data used in this study was genomic data from 687 Indonesian rice samples taken from four different rice fields with a total of 1,232 SNPs. This resulted in a lot of SNP combinations for each rice sample. Hence, the sparseness of the data is tackled by using a representation technique to enhance the model performance to predict the rice yield from the rice SNP data.
To the day this study was conducted, the representation of SNPs for machine learning models is still questionable. In language modeling (LM), there are localist and distributed representations.
The localist representation (LR), such as one-hot encoding for discrete random variables [10], suffers from its long sparse vector that requires n-bits as the representative of the corpus with nwords. On the other hand, the distributed [11] representations (DR) are denser as the embedding dimension can be freely set regardless of the n-words. DR is also context-dependent [11], contains continuous values instead of binary, and can generalize well [12] given the new unseen data.
Consequently, this research used DR as the representation technique.
There are existing works regarding DR in LM such as entity embedding. According to Cheng Guo and Felix Berkhahn [13], entity embedding can create a generalized model when using sparse data. Entity embedding has three advantages: reducing memory usage, faster encoding than onehot encoding, and revealing properties intrinsically by mapping the categorical variable [13]. This resulted in better machine learning performance when using neural networks to predict classification with a Mean Absolute Percentage Error (MAPE) from 0.101 to 0.093. Additionally, Feng Hou et al. [14] [15]. The proposed forecasting model achieved a MAPE within the range of 10 to 15%. Based on previous works, this research proposed a distributed representation of SNPs data using an entity embedding technique to predict the rice yield of the SNP data. Other research used entity embedding as can be seen in [16], [17].
Graph Embedding method can be used to create distributed representations such as DeepWalk, Large-scale Information Network Embedding (LINE), Node2Vec, Structural Deep Network Embedding (SDNE), and Struc2Vec. The first graph embedding method, DeepWalk proposed by Perozzi et al. [18], is an unsupervised feature learning to study latent representations in a graph network using graphs created from word sequences. Multi-label classification tasks for social network data are used for experiments such as BlogCatalog, Flickr, and YouTube. The proposed DeepWalk method achieved a 10% higher F1 score performance and outperformed other methods' performance using less training data Perozzi et al. [18]. The second one, Large-scale Information Network Embedding (LINE) was introduced by Tang et al. [19] to overcome problems in existing graph embedding methods which are graphs or networks that contain millions of nodes in the real world. According to Ma and Zhang [19], the proposed method of LINE is very efficient when embedding a network with billions of edges. This method improves the overall embedding effectiveness using Wikipedia data from 43.65% using DeepWalk to 66.10% using LINE with approximately 14 hours less training time than DeepWalk. Node2Vec, as the third method proposed by Grover and Leskovec [20], studied continuous feature representation in a graph and presented a more efficient way for multi-label classification. According to Grover and Leskovec [20], the Node2Vec method achieved an increase of up to 21.8% macro F1 score using Wikipedia from the previous graph embedding method such as DeepWalk and LINE. Then, Structural Deep Network Embedding (SDNE) is introduced by Wang et al. [21] with the method of semisupervised deep model graph or network embedding. According to Wang et al. [21], SDNE achieved a better F1 score performance on multi-label classification tasks using Blogcatalog, Flickr, and YouTube data. The last method, Struc2Vec, proposed by Ribeiro et al. [22] has an advantage over other graph embedding methods which capture the structural identity of a graph or 4 FERANO, SETYONO, SISWANTO, DOMINIC, PARDAMEAN network. Other than Tang et al. [19], there are a few other methods that used 2Vec such as [23]- [27].
In this study, entity embedding has been proposed to perform SNP embedding. Entity embedding was chosen because it has several advantages. When the input is sparse and the statistics are uncertain, entity embedding aids the neural network in generalizing more effectively.
It also can reduce memory usage and perform faster encoding than one-hot encoding. This study employed 1,232 SNPs from the genomic data of 687 Indonesian rice samples collected from four distinct rice fields. This resulted in a lot of SNP combinations for each rice sample.
The rest of this paper is organized as follows. In Section 2, the research framework was parsed down. In Section 3, the research results were presented by comparing multiple embedding techniques. In the last section, the findings were concluded and possible forthcoming works in this field were also suggested.

OVERVIEW
Single nucleotide polymorphisms (SNPs) are genetic variations that occur in each living thing on a single nucleotide block in DNA sequences. SNP takes place in various nucleotide blocks scattered throughout the DNA. SNPs data in nucleotide base pairing on rice genomic data is used in this study. The data is nucleotide base pairs that form alleles in the DNA sequence. Therefore, the data must first be converted into an embedding format. In this paper, a comparative study was proposed to assess the performance of different distributed representation methods, i.e., Embedder, Node2Vec, Struc2Vec, and LINE, using 1,232 Indonesian rice SNPs.

DATA PREPROCESSING
The nucleotide base pair in SNPs data needs to be encoded first. Table 1 shows the encoding method used on the SNPs data.
The data encoding SNPs carried out in this study resulted in three classes or categories of SNPs, namely homozygous major or reference, heterozygous, and homozygous minor or alternate with 5 SNP DISTRIBUTED REPRESENTATION USING ENTITY EMBEDDING encoding 0, 1, and 2, respectively. The encoding process was carried out by considering the presence of alternate alleles in the data obtained. If there is no alternate allele in the data, then the data is categorized as homozygous major and encoded to 0. If in a data there are alternate alleles in allele 1 and allele 2, then the data is categorized as heterozygous and encoded into 1. If both alleles are alternate, then the data is categorized as homozygous minor and encoded into 2. This transforms the biological writing of SNPs data into an embedding format. This encoding process resulted in n 0-1-2 SNP sequences, where n is the number of samples in the dataset. Embedder is to train a feedforward neural network with two hidden layers on the already prepared rice genomics' SNP data.
In this study, the Embedder was provided with two columns of data from the rice genomics' SNP data which are (1) rice yield which contains a numerical value from each rice sample, and (2) SNPs which contain the SNPs sequences. Before using the Embedder, the data was pre-processed by splitting them into three sets of data, namely training data, validation data, and testing data with the ratio of 70%, 15%, and 15%, respectively. Each of these sets of data was given to the same series of further preprocessing methods. Meanwhile, as the fourth step of the Embedder suggest that the embedding dimension must be provided, this study set two the values of max_dim, i.e., 10 and 50, for hyperparameter tuning.

COMPARATIVE STUDY
Three other methods were also used to be compared with Embedder, namely Node2Vec, Embedding, and Whole-Graph Embedding [28].
To produce the embedding from the graph data input, Cai et al. [28] mentioned five techniques that can be utilized, such as Matrix Factorization, Deep Learning, Edge Reconstruction, Graph Kernel, and Generative Model. According to Cai et al. [28], GE is used to solve graph analytics problems such as high computation and space cost by converting graph data into a low dimensional space yet still maintaining the graph data's structural information and properties.
Node2Vec, Struc2Vec, and LINE accept a process of preprocessing. Since the method used intends to represent the SNP data in the form of a graph consisting of vertices and edges, a proper criterion for building the edges should be determined. Therefore, a correlation calculation between 7 SNP DISTRIBUTED REPRESENTATION USING ENTITY EMBEDDING each SNP sequence was performed. The correlation calculation produced correlation scores in float data type indicating the degree of correlation between one SNP sequence and another SNP sequence. These correlation scores were then used to determine which vertices (SNP sequence of the n-sample node) should possess an edge over other vertices. A threshold value was given to filter correlation scores between SNP sequences and create graph edges using correlation scores of SNPs that are above the threshold value. Hence, a graph representation of the rice genomic SNPs was produced. This graph is then given to the embedding model and evaluated using a multilayer perceptron (MLP) regressor.

EVALUATION METRIC
The Mean Squared Error (MSE) and R2 or R-squared Score were chosen in this research to provide the means to compare and evaluate the models. MSE is a metric that calculates the average squared error of predictions [29]. On the other hand, before averaging the numbers, it computes the square of the difference between the expected and actual values [29]. R-Squared (R2) is a statistical metric used to assess the performance of regression models [30]. It assesses the strength of the association between the dependent variable and regression models on a simple 0-100% scale [30]. The R2 Score determines the dispersion of data points around the regression line [30].

RESULT AND DISCUSSION
The result was divided into two parts. The first part explained the R2 and MSE evaluation metrics by using different hyperparameter tuning. The second part was used to present the prediction and actual value plotting of the Embedder's best hyperparameter tuning configuration.

MODEL EVALUATION METRIC
There are four hyperparameter tuning configurations which are learning rate = 0.001 and embedding dimension = 10, learning rate = 0.0001 and embedding dimension = 10, learning rate = 0.001 and embedding dimension = 50, and learning rate = 0.0001 and embedding dimension = 50. SNPs sequence data of rice samples used in this study were given in several embedding experiments. This study proposes a method of representing SNP data using entity embedding. The proposed entity embedding method is then compared with several other embedding methods,  Table 2, Table 3, Table 4, and Table 5.   slightly better than LINE's performance. This is due to the random walk algorithm integrated into both models, which reduces the space and time complexity in processing graph-represented data.
The difference between Struc2Vec and Node2Vec lies in the graph processing method and perspective. The Struc2Vec model is highly dependent on the structure of the graph-represented data since it calculates or considers the relation between nodes by considering the symmetry point of the graph. Therefore, this model cannot perform well in processing complex or not-wellstructured graph data. This makes the Struc2Vec is also cannot outperform the LINE embedding model in certain hyperparameter tuning.
Neither Node2Vec nor Struc2Vec can outperform the Embedder. The reason is for the embedding in graph distributed representation, the graph needs to use a big dataset such as BlogCatalog, Flickr, and YouTube which is used by Perozzi et al. [15] and Wang et al. [18]. Hence, the proposed method of Embedder achieved the best result because the MSE comparative models' values are more than 0. Node2Vec is placed in the second rank since it achieved better MSE scores when compared to the other two embedding models, namely the Struc2Vec and LINE. It achieved the best value of 4.54 MSE score which was successfully obtained on the configuration of the learning rate parameter of 0.001 and the embedding dimension of 10. On the other hand, Embedder is constantly producing an MSE value below 0.

CONCLUSION
In this paper, SNP embedding has been carried out using entity embedding. The collected data used in this study consists of 687 rice samples taken from four different rice fields with a total of 1,232 SNPs. The SNP data used is tabular data that has been pre-processed so that the writing of the SNP data previously using the biological writing format has been converted into an encoded format. Through the experiments that have been carried out, it has been obtained that the entity embedding using Embedder can outperform the comparative methods used in this study, namely Node2Vec, Struc2Vec, and LINE. From the result, the Embedder entity embedding method achieved the best R2 and MSE scores of 0.9368 and 0.2425, respectively, by using the learning rate configuration of 0.0001 and the embedding dimension of 10. Future works of this research will be conducted by using different embedding methods to attain better SNP data representation performance.