Characterization of bovine (Bos taurus) imprinted genes from genomic to amino acid attributes by data mining approaches

Genomic imprinting results in monoallelic expression of genes in mammals and flowering plants. Understanding the function of imprinted genes improves our knowledge of the regulatory processes in the genome. In this study, we have employed classification and clustering algorithms with attribute weighting to specify the unique attributes of both imprinted (monoallelic) and biallelic expressed genes. We have obtained characteristics of 22 known monoallelically expressed (imprinted) and 8 biallelic expressed genes that have been experimentally validated alongside 208 randomly selected genes in bovine (Bos taurus). Attribute weighting methods and various supervised and unsupervised algorithms in machine learning were applied. Unique characteristics were discovered and used to distinguish mono and biallelic expressed genes from each other in bovine. To obtain the accuracy of classification, 10-fold cross-validation with concerning each combination of attribute weighting (feature selection) and machine learning algorithms, was used. Our approach was able to accurately predict mono and biallelic genes using the genomics and proteomics attributes.


Introduction
Most diploid organisms, including mammalians, receive two copies of each gene from their parents and express both alleles equally in their cells. For normal development, each individual needs to receive both the maternal and paternal genomes. For many genes in mammalian species, both the maternal and paternal alleles are equally expressed. However, the expression of some genes is determined by imprinting, an epigenetic event in which only one of the alleles inherited from one of the parents get silenced and inactivated [1]. Consequently, in a limited group of genes which are imprinted, one of the parental alleles is expressed preferentially [2]. The epigenetic mechanism in the form of imprinting leads to monoallelic expression of some genes depending on parent-of-origin of the allele [3]. Therefore, if the paternal allele of the gene is imprinted, the other allele from the mother would be expressed and vice versa. This results in unequal expression of two alleles of a gene, which is in contrast to Mendelian genetics. The imprinting mechanism in mammalian species is mainly conserved. Genomic a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 imprinting leads to allele-specific gene expression [4,5]. It has been shown that many human diseases including Prader-Willi syndrome (PWS) [6], Beckwith-Wiedemann syndrome (BWS) [7] and some types of cancer [8,9] are strongly associated with defective expression in imprinted genes. Large offspring syndrome (LOS) is an example of abnormal imprinting in bovine and ovine that causes abnormally high rates of growth which is phenotypically and epigenetically similar to BWS in human [10]. The conservation pattern between different organisms has greatly facilitated the study of imprinting mechanisms in some human genetic disorders [11]. On the other hand the importance of imprinted genes is increasing, because there are some evidences that imprinting defects are associated with complex disorders like diabetes, obesity, developmental abnormalities and behavioral disorders.
Although single imprinted genes could be observed in the genome, imprinted genes typically located in clusters with 3 to 12 genes with a length of 20 kb to 3.7 Mb of DNA [11].The clustering of imprinted genes is the key mark that the imprinting process is not specific to the gene and can act through cis orientation with elements controlling the expression of multiple genes. Experiments identified key controllers as imprint control element (ICE) or imprint control region (ICR) in seven imprinted clusters. It has been showed that deleting this element in mouse leading to loss of imprinted expression following transmission through the maternal or paternal germline [12]. In fact, when ICE is un-methylated, it operates as an initiator which epigenetically repress genes in a cis position manner [13]. Although DNA methylation, histone modifications and non-coding RNAs (ncRNAs) are key mechanisms in genomic imprinting, the genomic sequence is still important [14].
Most of imprinted genes play important roles in the growth regulation of fetus and placenta in the process of development [15,16]. Many imprinted genes have been identified in human and mouse while few imprinted genes have been identified in cattle, might indicate the limitations of the bovine data set. Computational prediction of imprinted genes using attributes in the genomic DNA sequences alone has been used for mouse and human genomes [17,18]. It has been reported that imprinted genes in mammals include only one percent of the whole genome. However, a wide range of imprinted genes from 100 [18] to 600 genes [17] and even more than 2000 genes [19] is reported in the literature. The variation may be due to ignorance of tissue-specific and conditional imprinting statues for some of imprinted genes. Interestingly, there are many genes that are imprinted in human or mouse but not imprinted in cattle [3]. Although the genome of bovine with the highest percentage of annotated genes and a high sequence coverage is the most characterized genome among livestock, few imprinted genes have been experimentally identified [20]. At present, only 25 imprinted genes experimentally validated in cattle and this list could be expanded by adding new imprinted genes [21,22]. With a complete catalog of imprinted genes in different mammalian species, it is feasible to understand the role of genomic imprinting in the evolutionary process [23]. With this quantity of confirmed imprinted genes in the bovine, there is a chance to find other genomic characteristics common to bovine imprinted genes.
In this study, several sequence attributes were selected to find common characteristics shared among imprinted genes. Because of known association of differential methylation with imprinting status in this region, GC contents and CpG islands were tested. It has been suggested that high-density CpG islands, tandem repeat patterns, and retrotransposons were selective characteristics of imprinted Differentially Methylated Regions (DMRs) [24]. Gene expression, histone modifications and transcription factor (TF) binding sites are strongly correlated with methylation mechanism in DNA [25]. The epigenetic regulation through methylation of CpG islands is an important mechanism in the differentiation of embryonic stem cells into specific cells and tissues [24]. Attributes of 20 known imprinted genes in human, mouse, and cattle species were previously described and compared by Khatib et al. (2007). They observed higher values for GC contents, CpG islands and tandem repeats in imprinted genes. In several imprinted genes, DMRs were related to monoallelic gene expression pattern [3]. DNA methylation occurs largely in repetitive elements including satellite sequences, centromeric repeats, and CpG islands in or near promoter sequences. CpG islands are GC rich areas in DNA with high proportions of CpG dinucleotides.
According to the original definition, CpG islands is defined as a region with GC content greater than 50%, at least 200 bp length and CpGs observed to expected ratio above 0.6 [26]. CpG islands are commonly regarded as epigenetic key regulatory elements. Although CpG islands are generally un-methylated, a significant number of CpG islands are methylated in genes promoter regions. Methylation in the promoter region is generally related to the silencing of genes related to different types of cancers [27]. Thus, gene silencing can generally occur after hyper-methylation of CpG islands in the promoter region which leads to inactive transcription. The level of DNA methylation has also been found to be related to gene length [28]. Other studies showed that retrotransposons or tandem repeats are unconfidently predictors of imprinting statues [29,30]. Repetitive elements are nucleotide patterns that, in contrast to unique sequences, occur multiple times in the genome. In fact, a substantial part of the mammalian genome consists of repeats. In accordance with previous reports in human and mouse, imprinted genes in cattle have remarkably fewer Short Interspersed Elements (SINEs) in number compared to biallelic expressed genes. Also, imprinted genes were found significantly underrepresented Long Interspersed Elements (LINEs) and Long Terminal Repeats (LTRs) in compare to biallelic expressed genes, which is in contrast with studies on human and mouse. Cowley et al. (2011) by observing no differences in repeat element prevalence at imprinted retrogene loci, concluded that the SINE depletion and LINE abundance were not required features for imprinting. Several studies reported differences between imprinted loci and biallelic expressed genes in the occurrence of the repeated sequences biallelic. A general agreement is that SINEs to be depleted at imprinted loci with increased frequency of LINEs. By taking the high density of LINE-1 elements around X-inactivated genes into account, it can be confirmed that the monoallelic expression of autosomal genes is flanked by high densities of repetitive sequences [31].
Additionally, some studies suggested a correlation between codon usage and the level of gene expression [32,33]. Therefore, in the current study codon usage by mono and biallelic genes was considered as a feature. To our knowledge, this is the first study that amino acid and codon usage attributes were used for determining characteristics of imprinted and biallelic expressed genes.
The purpose of machine learning methods is learning functional relationships from data without defining a priori [34][35][36]. In computational biology, the purpose is to obtain predictive models without strong assumptions about underlying mechanisms, which are less well known or unknown [37]. Since bioinformatics was introduced, researchers used machine learning to accelerate studies on biomolecular structure prediction, gene finding, genomics and proteomics [38]. We have hypothesized that data mining approaches would find unique attributes of mono and biallelic genes. Classification algorithms try to build a classification model given some examples of the classes we are trying to model. The obtained model can then be used to improve our knowledge on available data. The performance of a machine learning algorithm for classification tasks can be intensively influenced by the relevance of attributes. This performance can be easily diminished if redundant attributes are used [39]. Attribute weighting models create a set of more relevant attributes by reducing their size [40]. In some functional examples, different supervised and unsupervised machine learning algorithms were used by Ebrahimi et al. (2011) to capture attributes that related to thermostability of proteins [41]. Beiki et al. (2012) also used different supervised and unsupervised machine learning algorithms to classify and predict of olive cultivars [42]. Hosseinzadeh et al. (2012) used machine learning approaches to classify different types of lung cancer [43]. Clustering is a grouping process so that the objects in each cluster have high similarity with each other but they have high dissimilarity with objects in other clusters [44]. Discovering the relationship between input attributes and a target attribute is the main characteristics of supervised methods. The discovered relationship is then used in the structure of the model. Discovering structure by exploring similarities or differences among individual data is the ability of a relevant unsupervised algorithm [45]. Clustering is considered to be the most important problem of unsupervised learning method. Meanwhile, with little or no knowledge, relevant patterns and structures can be found directly from datasets by clustering [46]. Clustering problems can be solved easily by K-Means [47], which is a simple unsupervised learning algorithm. The process uses a simple method to classify a given dataset through a certain number of clusters (assumed k clusters) fixed a priori. Representative objects instead of the mean value of objects in each cluster are used by the K-Medoids methods as reference points.
The aim of the present study was to identify unique characteristics to distinguish mono from biallelic expressed genes in the bovine genome using supervised and unsupervised machine learning algorithms and attribute weighting methods. There are conflicting reports surrounding the status of SNRPN and ZIM2 genes to be mono or biallelic expressed genes. Therefore, we did not use these genes in our analysis. Furthermore, XIST gene is also a particularly challenging example due to its pattern of expression; it is randomly expressed in rodent, human and embryonic tissues [48], while it is just expressed paternally in the rodent extraembryonic tissues [49]. As such, XIST gene was also not included in our analysis. In spite of the fact that almost majority of genes in the genome have biallelic expression patterns, low number of validated non-imprinted genes were available in the bovine genome. Therefore, we randomly selected 200 genes from bovine genome to increase the number of genes in biallelic dataset. Therefore, with 8 validated nonimprinted genes and 200 randomly selected genes, initial dataset comprised 208 nonimprinted and 22 imprinted genes (S1 Table).

Extracted attributes from genomic and protein sequences
The initial dataset contained 200 attributes in three levels containing DNA structure, transcription level as codon usage and amino acids composition. CpG islands, SINEs, LINEs, LTRs, described for the DNA sequence, were extracted from the UCSC Genome Browser website (http://genome.ucsc.edu/cgi-bin/hgGateway; February 2006 build), using table tools. UCSC Repeat Masker tracks were used for finding of repeated elements. Also, GC contents were analyzed using the R environment (http://www.r-project.org) via BS genome package (BSgenome.Btaurus.bosTau8). The Uni-Prot Knowledgebase (Swiss-Prot and Tremble) database was used for extracting protein sequences. A number of attributes such as count and frequency of each amino acid and negatively and positively charged amino acids were extracted using various bioinformatics tools and softwares from the ExPASY website (http://www. expasy.org). Codon usages of genes shown by the frequency of each codon types were obtained from www.bioinformatics.org/sms2/codon_usage.html.
The dataset was imported into the RapidMiner (RapidMiner7.5.003, www.rapidminer. com), and imprinted and biallelic expressed genes were set as the target or label attribute. In addition, differences between imprinted and biallelic expressed genes regarding attributes were statistically tested using Mann-Whitney U test. Distributions of these attributes across imprinted and biallelic expressed genes were then visualized as boxplot graphs using BoxPlotR web tools [50].

Data cleaning
To eliminate repeating characteristics, attributes with correlation coefficient higher than 0.95 were not considered. Furthermore, numerical attributes with standard deviation lower than or equal to 0.1 were removed. The final refined dataset was considered as the main source and labeled as Mds dataset.

Attribute weighting
Attribute weighting is a method of choice to identify attributes contributing to objects. This method was used to identify important attributes and their contribution to allele-specific expression. The procedure suggested by Ebrahimi et al. (2011) was used as the main guide to do attribute weighting. Eleven attribute weighting algorithms consisting weight by information gain, weight by information gain ratio, weight by principle component analysis, weight by correlation, weight by rule, weight deviation, weight by chi squared statistic, weight by Gini index, weight by uncertainty, weight by relief, weight by support vector machine were used in the main dataset.
In weight by information gain algorithm, the relevance of each attribute was evaluated by computing the information gain in class distribution. Higher weight for an attribute means it is more appropriate than others. In weight by information gain ratio, the information gain ratio for the class distribution is an indicator of the relevance of each attribute. Weight by principle component analysis (PCA) operator creates attribute weights using a component generated by the PCA. In PCA an orthogonal transformation is used to convert values of correlated attributes into observations of uncorrelated attributes. The weight of each attribute shows its importance with respect to the class attribute. In weight by correlation operator, the weight of each attribute with respect to the label attribute is calculated using correlation. The weight of each attribute show its relevance. In weight by Rule operator, the importance of each attribute of the given example set is identified by constructing a single rule for each attribute and estimating errors. In weight deviation operator, weights are created from the standard deviation of all attributes. Normalization of values is done by the average, minimum, or maximum of the attribute. In weight by chi-square statistic, the relevance of an attribute was computed based on the value of the chi-square statistic with respect to the class attribute. The chi squared statistic is used to find out whether a distribution of observed frequencies differs from expected frequencies. In chi-squared statistics, frequencies are used instead of means and variances. Weight by Gini index is another operator that identify relevant attributes by calculating the Gini index of the class distribution. In weight by uncertainty operator, the relevance of an attribute was identified by calculating the symmetrical uncertainty with respect to the class. Weight by Relief is another operator which measures the relevance of attributes by sampling examples and comparing the value of the current attribute for the nearest example of the same and of a different class. Obtained weights are normalized into the range of 0 and 1. Weight by Support Vector Machine (SVM) is an operator that attribute weights are coefficients of the normal vector of a linear SVM [51].

Selection of Attributes
A value between 0 and 1 was obtained for each attribute, after performing attribute weighting models on the Mds. This value shows the relevance of the attribute with regards to the imprinted or biallelic expressed gene as a target attribute. Variables were selected with weights more than 0.50 and consequently 11 new datasets were created (Awds) ( Table 1). These data sets were named based on their attribute weighting models (Info Gain, Info Gain Ratio, PCA, Correlation, Rule, Deviation, chi squared, Gini index uncertainty, relief and SVM) and then supervised and unsupervised models were used. Each model of the supervised or unsupervised algorithm was performed 12 times, first on the Mds and then on the new eleven datasets (Awds).

Supervised classification
Finding the correlation between input and target attributes was the main aim of supervised methods. The supervised classification was applied on the main dataset and the eleven newly created datasets from attribute weighting.
Decision trees. Four tree models i.e. decision tree, decision stump, random tree, and random forest were performed on Mds and Awds.
Neural network and Bayesian models. In quantitative modeling, the artificial neural network is an important analysis tool. Pattern classification, time series analysis, and prediction and clustering are examples of data mining tasks that can be done by neural network [52]. Four neural network models consisting, deep learning, neural net, autoMPL, and perceptron were performed on all datasets. Nowadays, deep learning is considered the highest usage fields in machine learning in computational biology [53][54][55][56][57][58][59][60]. Deep Learning was trained with stochastic gradient descent and in this process, back-propagation was used. Neural network learns a model by a feed-forward neural network trained using back-propagation algorithm (multi-layer perceptron). The feed-forward neural network or multilayer perceptron (MLP), are widely used neural network models in practice [52]. AutoMLP is an algorithm which can be used for learning rate and adjustment size of neural networks during training. The perceptron is the simplest kind of feed-forward neural network. The single layer perceptron is a linear classifier efficiently trained by a simple update rule for all classified data points.
Three Bayesian models consisting Naive Bayse, Bayse Kernel, and W-BayesNet were performed on all dataset. A Naive Bayes classifier is a simple probabilistic classifier based on Bayes' theory with strong independence assumptions. In other words, a Naive Bayes classifier assumes that a specific attribute is not related to any other attributes. Naive Bayes is a popular machine learning algorithm performing well in many domains, and mostly used in binary sentiment classification [61][62][63]. The Naive Bayes (Kernel) operator can be used for numerical attributes which are in contrast with the Naive Bayes operator. The kernel is a non-parametric estimation technique that is used as a weighting function.

Unsupervised clustering
Unsupervised learning involves methods that group instances with any kinds of pre-specification. The unsupervised clustering was applied on the main dataset and the eleven generated data sets from attribute weighting.

K-means and k-medoids
K-means operator predicts the distance of objects to clusters using kernels. Considering the property of kernels, for calculating one distance all elements of a cluster must be generally added. K-means is known as one of the simplest algorithms of the parametric unsupervised technique [46]. K-medoids algorithm is a partitioned clustering algorithm which is slightly modified from the K-means algorithm. They both attempt to minimize the squared-error but the K-medoids algorithm is more robust to noise than K-means algorithm. In K-means algorithm, means are chosen as the centroids but in the K-medoids, data points are chosen to be the medoids. A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal.

Accuracy of models
To identify the performed classification and to estimate the accuracy of using each attribute,10-fold cross validation (10-fold CV) with stratified sampling was applied to train and test models on all patterns. Some random subsets are generated by stratified sampling in order to make sure that there is no difference between class distribution in the subsets and in the whole example set and almost the same rate of two values of class labels were in each subset. To do this, 10 parts were generated from all records randomly, the training phase was performed on 9 parts and testing phase was done on the 10 th part. The accuracy for true and false and total accuracy was calculated after repeating the process for ten times. The mean of the accuracy for all ten tests was reported as the final accuracy. The same cross-validation was used for all the evaluation methods and each method had the same training and testing sets used during the genetic algorithms run.

Data cleaning
The initial dataset contained 230 imprinted (monoallelic) and biallelic expressed genes with 200 attributes. 22 genes have been experimentally validated as imprinted genes and 208 genes were biallelic. After taking into account the standard deviation of each attribute and Pearson correlation coefficients between them, the amount of attributed used in the current study decreased to 164 more informative non-redundant attributes (Mds).

Attribute selection via attribute weighting
A list of 20 most important attributes selected by algorithms of attribute weighting has been shown in Table 2. This table highlights the most important attributes selected that were chosen by several weighting algorithms. The frequency of SINE 10 and 100 kb up in the sequence of genes were attributes selected by these algorithms. It indicates the importance of repetitive elements in distinguishing imprinted from biallelic gene. Based on Fig 1A and 1B the imprinted genes have lower SINE in 10 and 100 kb up regions in comparison to biallelic expressed genes (P<0.000). Fig 1D shows the web chart of repetitive elements in gene region of imprinted and biallelic expressed genes. This figure demonstrates that SINE, LINE and LTR are more important than simple repeat for distinguishing biallelic from imprinted genes. Fig 1C shows the frequency of repetitive elements in 10kb up region of imprinted and biallelic expressed genes. Average length of CpG islands in 5'-UTR was also an important attribute that selected by several attribute weighting parameters (Fig 2A). Mann-Whitney U test revealed a significant difference (P<0.000) for average length of CpG islands in 5'-UTR between imprinted and biallelic expressed genes (S2 Table). In amino acid attributes alanin (Ala), argenine (Arg) and proline (Pro) were the most important amino acid attributes selected by algorithms. Frequency (fraction) of codon usage (Pro/CCG) in imprinted and biallelic expressed genes have been shown in Fig 2B. According to the boxplot of Ala and Pro frequencies (Fig 2C and 2D) it can be concluded that imprinted genes have higher Ala and Proin their protein sequences. The frequencies of Ala and Pro amino acids were significantly different between proteins produced by imprinted and biallelic expressed genes (P<0.000). Fig 2E, shows significant differences in the frequency of Ala, Arg, Asn, Ile, Leu, Lys, Phe, Pro and Val amino acids in protein sequence between imprinted and biallelic expressed genes.   Supervised classification Decision tree. Decision tree, decision stump, random tree and random forest with four criteria consisting gain ratio, information gain, Gini index and accuracy were run on the Mds and Awds. The accuracy of supervised classification algorithms on various datasets were calculated by 10-fold cross validation and was shown in Table 3 and S3 Table. The accuracy for the majority of models was greater than 90 percent. The highest accuracy was estimated by random forest with info gain criteria on the chi-squared data set (95.22) and the lowest accuracies achieved by random tree with Gini index criteria on the deviation dataset (85. 65). Fig 5A, shows the random forest with info gain criteria on chi Squared dataset. This model had the  Fig 5B and 5C, represents models with random forest with Gini index and info gain criteria on the Info Gain data set, respectively. The accuracies of both models were 94.35, with 40.91 and 45.45 imprint recall and 100.00 and 90.91 imprint precision, respectively. Based on algorithms of attribute weighting, the number of CpG (CpGn) in 3'-UTR region was found to be one of the main attributes (Table 2). Fig 5B, shows the application of this attribute for distinguishing imprinted and biallelic genes. Values higher than 59 indicate imprinted genes and values lower than 59 refers to biallelic expressed genes.
Neural network and Bayesian models. Table 4 presents the performance of different neural network algorithms and Bayesian models using provided data sets. These models had an average accuracy of 86 percent. The highest predicted accuracy achieved by W-BayesNet on the SVM dataset (96.52%) and the lowest accuracies achieved by perceptron on the PCA data set (29.13%). W-Bayes Net on SVM were performed with Kappa value of 0.763, imprint recall of 72.73, and imprint precision of 88.89 (S4 Table). Whitney test. Frequency of using the Pro/CCG in the imprinted genes were higher than biallelic expressed genes. (C). Boxplot of Ala amino acid frequency. P-value (0.000), determined by the Mann Whitney test. As shown in the figure the frequency of Ala amino acid in protein sequence derived for imprinted genes was higher than the biallelic expressed genes. (D). Boxplot of Pro amino acid frequency. P-value (0.000), determined by the Mann Whitney test. As shown in the figure the frequency of Pro amino acid in protein sequence derived for imprinted genes was higher than the biallelic expressed genes. (E). Web chart of amino acids in protein sequences of imprinted (M) and biallelic (BI) expressed genes. P-value determined by the Mann Whitney test. The protein sequence derived attributes showed that frequency of some of the amino acids was different among imprinted and biallelic expressed genes. (F). Web chart of GC content in several regions of imprinted (M) and biallelic (BI) expressed genes. P-value determined by the Mann Whitney test. In the several region GC content of imprinted genes was significantly higher than biallelic expressed genes. https://doi.org/10.1371/journal.pone.0217813.g002

Unsupervised clustering
Unsupervised clustering algorithms consisting K-means and K-Medoids were implemented on the Mds and Awds. Each kind of clustering was not able to completely distinguish between mono (imprinted) and biallelic expressed genes. The accuracy of unsupervised clustering algorithms on various datasets obtained by 10-fold cross validation and was shown in Table 5.
Test accuracy of classifiers. To test the accuracy of the classifiers on a new set of genes, the imprinted bovine genes identified by Chen Z et al (2016) were used. 50 classifiers with accuracy more than 92 percent were tested on 35 genes that identified as imprinted by Chen Z et al (2016) [64]. Number of classifiers successfully detected gene as imprinting shown in Table 6 and S5 Table. We found that except for 5 genes, our models were able to classify the remaining genes as imprinted.

Discussion
SVM dataset with the average accuracy of 91.71% in induction models and 94.20% in Neural Network and Bayesian models had the highest accuracy among evaluated datasets. Therefore, this pattern could be better than others in distinguishing imprinted and biallelic expressed The performance of gene expression is influenced by the composition of the nucleotide in the coding region like GC content and codon usage. Although, these two important attributes are strongly connected to gene expression, the molecular functions behind these attributes are not completely obvious [65]. The GC content of imprinted genes in several regions was higher than biallelic expressed genes (Fig 2F). These observations are in agreement with the results presented by Khatib et al. (2007). However, in contrast to these results, it was reported by Walter, et al. (2006) that the GC content of genes with imprinted expression is like a subset of genes in the mouse genome which are randomly selected. In addition, Hutter et al. (2006) compared human and mouse genomes and reported no statistically significant differences between control and imprinted genes in these species. Such contradiction might indicate a different evolutionary pathway for cattle genome compared to those of mouse and human.
Furthermore, previous studies suggested a positive correlation between codon usage bias and the level of gene expression [32,65,66]. It could be an important element for the stability of mRNA and identification of translational efficiency [65]. Also in some species, it has been shown that there is a correlation between preference of codon usage and the abundance of the respective tRNA [67][68][69][70][71] which at last can affect translational efficiency of the gene products [70]. Fig 1D shows the boxplot of codon usage (Pro/CCG) in imprinted (M) and biallelic (BI) expressed genes.
Differences in the occurrence of repeated sequences between biallelic expressed genes and imprinted genes have been reported in several studies. Usually, high CpG islands, transcription factor binding sites (TFBS) and repetitive elements are the sequence characteristics of imprinted genes [72,73]. For analyzing known and hypothetical imprinted genes, these attributes are normally used [3,19]. Fewer and smaller introns can be seen in imprinted genes compared to non-imprinted genes. They have also some degrees of repetitive sequences and contain an unusually high number of retrotransposable elements [74]. In present study intron counts in imprinted and biallelic expressed genes were not significantly different (P<0.184). However, imprinted genes showed significantly lower frequencies of SINE in 10 and 100 kb up compared to biallelic expressed genes in bovine. Depletion of SINEs is adapted with reasonably increased frequency of LINEs in imprinted genes. Interestingly, Cowley et al. (2011) found no significant differences in LINE or SINE in the mono and biallelic expressed genes neither in mouse nor human. In the study of Greally (2002) several sequence attributes of human imprinted genes were compared to non-imprinted genes. He demonstrated that fewer SINE transposons-derived sequences can be seen in imprinted loci than biallelic loci and the number of SINEs is directly correlated with the imprinting status. Additionally, the frequency of SINE in the flanking regions of biallelic and imprinted genes is low (Allen et al., 2003). Cowley et al. (2011) suggested that LINEs and SINEs are genomic attributes which are not directly correlated with imprinting status of genes. Khatib et al. (2007) found higher GC content, CpG islands, and tandem repeats in imprinted genes than nonimprinted genes in bovine. Additionally, the frequency of SINEs in imprinted genes of bovine was lower than biallelic expressed genes. These findings are in agreement with findings in human and mouse. In accordance with Khatib et al. (2007), the number of LINEs and LTRs were found to be significantly lower in imprinted genes compared to biallelic genes in present study (Fig 3B) which are in contrast to the findings in human and mouse. Allen et al, (2003) used cluster analysis to examine heterogeneity in monoallelic genes in human and mouse with respect to sequence attributes. They found biallelic expressed genes display lower number of LINE-1 sequence, while the imprinted genes were flanked by lower amounts of SINE sequence [31]. Weidman et al. (2004) reported that the imprinting of IGF2 is strongly associated with the lack of SINEs. Similarly, Walter et al. (2006) showed that the frequency of SINEs in imprinted genes (7.32%) is lower than non-imprinted genes (13.9%). They also reported higher GC content and higher number of LINEs sequences in the imprinted genes compared to non-imprinted genes.
The pattern of distribution of repetitive elements, in combination with other sequence attributes, has been used for prediction of putative imprinted genes in the mouse [17] and human genomes [18]. Some studies showed that in human genome, SINE elements are enriched in Characterization of bovine imprinted genes GC-and gene-rich regions, whereas LINE elements harbor lower GC and are mostly found in gene-poor regions [75,76]. Tandem repeats have been implicated in the regulation of mouse imprinted genes including Rasgrf1, Xist, and Tsix [77][78][79]. Although the mechanism of imprinted methylation through tandem repeats is unknown, hypotheses such as siRNA mediated regulation, secondary structure formation, and involvement of germline-specific repeat binding factors could be take into account. Tandem repeats might continuously produce siRNA through the use of RNA-dependent RNA polymerase (RdRP) and cause multiple rounds of RNA interference (RNAi) [80]. Such events have been detected in the regulation of epigenetic phenomena in fission yeast and plants and may operate in mice as well [81][82][83][84]. Therefore, these attributes may affect the establishment or maintenance of DNA methylation at imprinted loci during development, especially during germline development.

Conclusions
According to our results, attributes related to GC content and CpG in upstream and downstream regions of genes, SINE in 10 and 100kbUp and frequency of some amino acids including Ala, Arg, Pro were the most important attributes for distinguish imprinted and biallelic expressed genes. The sequence characteristics presented in the current study predict the imprinting status of genes in bovine with high accuracy. This method could be applied to expand the number of imprinted genes in genome of other species. With more imprinted genes in hand, it would be possible to deepen our understandings regarding the genetic and epigenetic regulatory mechanism involved in the monoallelic expression of imprinted genes. Besides, assessment of the method in other genomes might be useful to find an evolutionary relationship among species and would be beneficial to find monoallelically expressed genes elsewhere. Also, the next step would be the application of these patterns in the identification of novel sets of imprinted genes.
Supporting information S1