Recent Research Progress of Long Non-coding RNA

In general, long non-coding RNA (lncRNA) refers to the transcripts that are greater than 200 nucleotides in length which cannot encode a protein but have influential functions. Recently, lncRNA has gradually become a hot spot in the study of biology. This review summarized the advances in understanding the roles of lncRNA in human disease, plant biology among many other directions and how these results would affect us. An increasing number of lncRNA software has been developed to predict lncRNA with the increasing attention paid to lncRNA. We mainly introduced several related software for lncRNA predictions, including their sorting algorithms, and different advantages and disadvantages in their performance and accuracy. With the improvement of sequencing technologies, the approaches for predicting lncRNA have also changed accordingly. In this review, the lncRNA prediction and quantitative processes of different sequencing technologies were introduced in detail. Finally, we summarized some of the challenges in lncRNA research as well as future prospects of lncRNA research, hoping to provide some help for future lncRNA research.


Background
Gene expression regulation plays an important role in growth and development of creature, and it can be divided into regulation before transcription, regulation during transcription, regulation after transcription, regulation during gene translation and after translation (Wang et al., 2015). LncRNA plays a vital role in gene expression regulation of eucaryon (Hu et al., 2016). It is regarded that many lncRNA is not only correlated with mRNA expression regulation, but also with growth and development of creature as well as biological diseases (Guo and Liu, 2014).
As more and more researchers are involved in the study of lncRNA, the main mechanism by which lncRNA performs biological functions has become increasingly clear. LncRNA mainly perform their biological functions through the following biological mechanisms. For example, cell cycle regulation, mRNA degradation and chromatin remodeling, gene imprinting, increasing the stability of mRNA, serving as the skeleton molecule of histone modified complex and regulating the phosphorylation of serine and arginine splicing factors, and so on (Xia et al., 2013).
With the in-depth study and development of lncRNA, it has been found that lncRNA not only performs its biological functions in organisms with various mechanisms, but also is related to the occurrence of certain diseases. For example, latest studies have shown that lncRNA molecules can significantly accelerate the metastasis and regulate the epithelial interstitial transformation that controls colon cancer (Chen et al., 2017). Although with the progress of sequencing technology, a large number of lncRNA have been identified. Our understanding of the functions and actions of these identified lncRNA is still quite limited, and we still need to explore the functions and actions of these lncRNA in vivo through new research methods.
LncRNA research is at the forefront of biological research and is one of the hot spots at present. The study of lncRNA helps us to deep understand the structure and functions of genes from a new perspective. With the continuous improvement of research technology and level, some biological regulatory processes related to lncRNA will be more clearly displayed to us, thus allowing us to further understand the functions of different transcripts. Some lncRNA have been found to have abnormal expression in tumor tissues, and these related lncRNA can be used as biomarkers for the prediction of tumor or cancer (Yarmishyn and Kurochkin, 2015). Up to now, a large number of lncRNA have been identified as biomarkers for cancer (Iyer et al., 2015). For example, in a study of lncRNA in prostate cancer cells, researchers discovered an lncRNA called SChLAP1 (Zhao, 2014). The expression of these lncRNA in prostate cancer cells was significantly higher than that in prophase prostate cancer cells. This characteristic is helpful for judgment of cancer and postoperative observation, thus increasing means of studying the pathogenesis of cancer of researchers. In plant studies, researchers have found that rice specific lncRNA-LDMAR could cause programmed cell death in the anther tapetal layer of rice, and result in photosensitive male sterility (Ding et al., 2012). This result might be of great significance to the research of light regulation pathway of rice.

Action Mode of LncRNA
LncRNA does not encode proteins, but it can perform a lot of important functions in vivo through the regulation of genes. LncRNA can perform their functions in the following ways ( Figure 1): 1. LncRNA can be transcribed in the upstream region of genes, which can affect gene expression; 2. LncRNA can affect gene expression by inhibiting certain polymerases; 3. LncRNA acts by interfering with mRNA splicing; 4. Cooperate with Dicer enzyme to regulate gene expression; 5. Directly regulate the activity of related proteins; 6. Formation of nucleic acid protein complex; 7. Change the localization of cytoplasm; 8. Interaction with related small molecules (Jeremy et al., 2008). The regulation of lncRNA on gene expression can be classified into the following categories ( Figure 2): 1. lncRNA will regulate genes at the level of epigenetics; 2. LncRNA will regulate during RNA transcription; 3. It will regulate after RNA transcription completed. Relevant studies have shown that lncRNA can activate nearby gene expression and also regulate the activity of transcription factors and RNA polymerase so as to regulate gene expression as a transcription activator (Ørom et al., 2010).

Research Progress of LncRNA in Human Diseases
We now know that lncRNA can bind to and interact with large molecules in cells, such as DNA, RNA and proteins, to control some important cancer phenotypes related to transcripts. Numerous studies have shown that HOTAIR lncRNA is closely related to human cancer (Wu et al., 2014). HOTAIR was found for the first time in human breast cancer cells. If this expression level increases in the tumor, it means that the tumor will deteriorate to the advanced stage of cancer and cannot be treated, so it can be used as an important tumor marker for diagnostic detection (Gupta et al., 2010). LncRNA also regulates the genes that encode proteins that are self-related. If the regulation of these lncRNA is wrong, it may also lead to the occurrence of diseases. Relevant studies have shown that p15, a gene that inhibits cancer, can be transcribed to generate an antisense lncRNA, which is able to induce DNA methylation, leading to leukocytosis (Yu et al., 2008). In addition, there are other related diseases, such as liver cancer related lncRNA:ZFAS1. If the expression of this lncRNA increases in mice, the metastasis of liver cancer cells will be promoted (Li et al., 2015). Another disease is also closely related to lncRNA. Alzheimer's disease is also known as senile dementia. Some recent studies have found that lncRNA is also an important factor in the formation of Alzheimer's disease. The main cause of Alzheimer's disease is the amyloid produced by a secretase, which will induce Alzheimer's disease if it accumulates in the body (Burns, 2009). The antisense chain of the secretase encoding gene BACE1 can be transcribed to generate lncRNA:BACE1AS (Tan et al., 2013). This lncRNA prevents mRNA produced by the coding genes of secretase from being degraded, leading to the continuous increase of amyloid protein, while the accumulation of amyloid in turn leads to the expression of secretase encoding genes. This kind of positive feedback mechanism makes Alzheimer's disease or senile dementia worse. According to this mechanism, silencing or inactivating lncRNA might be a way to treat or alleviate Alzheimer's disease.

Research Progress of LncRNA in Plant
Compared with lncRNA in animals, researches of lncRNA in plant are far from enough. Relatively few species are studied on lncRNA at present. Studies on lncRNA in mammals are relatively abundant, compared with those in plants. Among them, human and mouse lncRNA have been studied relatively deeply in mammals (Sun et al., 2012;Shi et al., 2013), while the types of plants studied are more limited. These lncRNA in plants have surprisingly strong tissue expression specificity. A specific example is the identification of rice lncRNA at the genome-wide level through the use of high-throughput sequencing technology . Analysis of the identified rice lncRNA revealed that these lncRNA had high tissue specific expression.
It was found in the studies on rice lncRNA that some of these lncRNA were highly expressed during sexual reproduction of rice, suggesting that the function of these lncRNA was related to sexual reproduction in rice. Further studies showed that one lncRNA (XLOC_057324) in rice could affect the development of rice spikes. The researchers found the insertion mutation of a lncRNA, XLOC_057324, in the rice mutation database. Mutations at the lncRNA insertion site resulted in a significant decrease in isoforms content of such lncRNA, leading to phenotypic changes. Researchers planted the lncRNA mutant and wild type of ZH11 rice at the same time, and found that the mutant rice bloomed earlier than the wild type, but the ears of the mutant rice were not significantly as full as the wild type. The findings of this study fully proved that lncRNA (XLOC_057324) was involved in the regulation of rice spike growth and development and was closely related to sexual reproduction of rice.
Latest researches have indicated that lncRNA also plays an important regulatory role in the growth and development of soybeans. In a study on lncRNA in soybean, researchers found that lncRNA could be involved in stress response, signal transduction and development process through co-expression analysis of protein-coding genes and lncRNA. In addition, the researchers also observed the expression of lncRNA in centromere regions, especially in active meristems, suggesting that lncRNA might be involved in the regulation of cell division.

LncRNA databases
With the wide spreading of lncRNA researches, a number of different organizations and academic research institutes have established lncRNA databases in order to facilitate relevant researches and academic exchanges (Amaral et al., 2011). The data of these databases are from a wide range of sources, of which some are from published literature or directly from experiments, and are classified according to different species or different research objects (Table 1).  (Liu, 2004). When using this database, there is a place that needs to be paid special attention. Since this database uses a set of lncRNA naming system specified by itself, if the general naming system is used to search in this database, the desired lncRNA cannot be found. Here is a way to find the lncRNA we want in this database. Blast function is provided in this database. If the user has the nucleic acid sequence of lncRNA, the corresponding number of lncRNA in this database can be found according to blast results, so as to further find the annotation information that the user wants.
http://www.noncode.org lncRNADisease The database collects more than 160 disease-related lncRNA and provides annotations of disease-related lncRNA reported in literature (Chen et al., 2012). It is worth mentioning that this database provides the function of browsing lncRNA annotation information on the website, which makes it much easier for us to annotate lncRNA.
http://cmbi.bjmu.edu.cn/l ncrnadisease CHIPBase This database is significantly rich in content, so you can find a lot of lncRNA comments here. This database also provides the loci that bind lncRNA and expression map to transcription factors identified by RNA-seq (Yang et al., 2013).
http://deepbase.sysu.edu. cn/chipbase/ lncRNAdb The database collects eukaryotes lncRNA information that has been reported in the literature. Each entry in the database contains reference information about RNA, the expression of lncRNA, subcellular localization and conservative, functional evidence and other relevant information. We can use the lncRNAdb database to continuously proofread the literature related to lncRNA and other genomic elements (Amaral et al., 2011). http://www.lncrnadb.org/

LncRNA classification
There are many classification methods for lncRNA. lncRNA can be divided into the following categories according to the relative positions of lncRNA in the genomes and exons (Figure 3). Sense lncRNA refers to lncRNA that overlaps with one or more exons of the protein-coding gene of the chain in the same position.
Antisense lncRNA refers to lncRNA that overlaps with one or more exons of antisense chain gene. Intergenic lncRNA refers to the lncRNA generated by two protein-coding genes. Intronic lncRNA refers to the lncRNA between two exons from the protein-coding gene. At present, the functions of different types of lncRNA have not been deeply studied. Finding common features related to the functions of different types of lncRNA will be of great help to further understand the mechanism of lncRNA in the body.

Software Related to LncRNA Prediction
The key problem to predict lncRNA is how to distinguish mRNA from ncRNA. At present, the main method to distinguish mRNA from ncRNA is to establish a classifier, which is mainly based on the sequence characteristics of lncRNA, such as the modification site of histones, the arrangement of bases and the conservatism of sequences.
One of the representatives of the software developed based on the characteristics of base sequence is PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme) . This software classifies transcripts based on SVM (support vector machine) algorithm. It calculates the k-mer frequency of transcripts, and can divide transcripts into protein-coding transcripts and non-protein-coding transcripts. The classification of PLEK does not depend on sequence alignment or genomic information. In addition, one of the advantages of the software is that it runs faster. When forecasting the same set of data, the prediction speed of PLEK software is 8 times that of CNCI and 244 times that of CPC which is the most popular prediction software. The accuracy of PLEK may fluctuate when it predicts different species. For example, when using the known mRNA of mice for testing, PLEK wrongly judged the most lncRNA, but if the data of mice were replaced to the data of corn, PLEK was satisfactory for its higher accuracy performance. In general, it is a stable and reliable prediction software.
Another classic lncRNA prediction software was CNCI (Coding-non-coding Index), which was developed by the team of Zhao Yi from the Institute of Computing, Chinese Academy of Sciences  classifies sequences based on their characteristics. This software calculates the replacement frequency of codons adjacent to the transcripts of coding protein and non-coding protein, which is later used to construct score matrix. The most similar coding sequence (CDS) are selected, and SVM classifier is constructed by combining single nucleotide frequency. This software works for incomplete sequences, so it is more suitable for partial EST sequences or transcripts spliced from scratch.
At present, there are many other prediction software in addition to the two kinds of prediction software described above. The core ideas of these software are the same, but the specific implementation methods are different. According to the different cases of specific sequence, the predicted results of these software have their own advantages and disadvantages, and the intersection of the predicted results of several software can be used as a reliable result.

Bioinformatical Identification Procedure of LncRNA
Now the analysis and identification process of lncRNA can be roughly divided into two categories according to their data sources ( Figure 4). The data generated by Illumina sequencing technology can be identified and analyzed in one category. Another category is to identify and analyze the data obtained by single-molecule sequencing with third-generation sequencing technology. Figure 4 Flow diagram of lncRNA identification by next generation sequencing data and three generation sequencing data Illumina sequencing platform is widely used. The data generated by Illumina sequencing platform need to be analyzed and converted into original sequence data (Raw Data). The next thing is to filter and clean the data, because the raw offline data generally contains some joint contamination, and may also contain some low-quality reading lengths. The general requirements of filtration operations are to remove reads with sequencing connectors or selectively remove bases that are not accurately measured (Iyer et al., 2015). Clean filtered data after filtration is assembled into transcripts by software. There are two ways of assembling transcripts. One is the assembly based on the reference sequence, which used Tophat to compare the sequencing data to the reference genome (Kim, 2014), and later used cufflinks for stitching . This method has high sensitivity and requires less memory for calculation. The other is to obtain transcripts directly from overlap assembly between sequencing reads without reference genome assembly. This method does not rely on the comparison software and the existing reference genome, but requires large memory resources as well as higher sequencing depth.
After the assembly of transcripts, the next step is to screen lncRNA. There are several basic screening methods for lncRNA: one is length, the other is the exon number. For instance, we usually select the length longer than or equal to 200 bp from the assembled transcript according to the common distribution of lncRNA (Wilusz et al., 2009). The resulting transcripts were then functionally screened to predict their potential ability to encode proteins. The analysis methods commonly used now are as follows: CPC software analysis, CNCI software analysis, pfam protein domain analysis and CPAT software analysis. In general, the intersection of these software types will be taken to reduce the potential false positives probably caused by a single software.
Once lncRNA is obtained, downstream analysis can be carried out. For instance, lncRNA family classification, lncRNA expression analysis, lncRNA difference analysis, lncRNA-mRNA co-expression analysis, Pathway enrichment analysis, and so on. Finally, functional analysis and sample properties of lncRNA will be combined to discuss relevant biological problems.
With the continuous advance of sequencing technology, the third-generation of high-throughput sequencing technology has been recognized and accepted by more and more researchers, and the latest technology has been timely applied to scientific research practice, so we have a powerful tool on the way to explore life science. In particular, the third-generation high-throughput sequencing technology represented by the Single Molecule real-time Sequencing (SMRT) technology of Pacific Biosciences is widely applied in scientific research. One of the reasons for the popularity of third-generation sequencing technology is that its ultra-long read-length data is extremely convenient for research. For instance, the average reading length of PacBio single-molecule real-time sequencing technology can reach about 10 kb, which enables us to obtain a complete transcript sequence without relying on splicing. Due to the short reading length of second-generation sequencing technology, full-length transcript sequences cannot be obtained directly. In the process of research, a step of splicing transcript must be added. However, splicing is inevitably leading to errors, which limits our research on transcript. Benefit from the advantage of the ultra-long read-length of third-generation high-throughput sequencing technology, we have more and more flexible methods for the study of transcripts. The main processes of lncRNA identification using the third-generation high-throughput sequencing technology can be summarized as the following parts.
Firstly, the RNA of the sequencing object was extracted, and later was reversely transcribed into cDNA. Then, libraries of different lengths and sizes were established according to specific requirements, and finally the libraries were sequenced. After getting the original sequencing data, we still need to filter the data. Insertion of high quality was classified into full-length transcripts and non-full-length transcripts. Full-length transcripts were clustered and non-full-length transcripts were used to correct the clustered full-length transcripts (Gordon et al., 2016) to obtain high-quality transcripts.
The following procedures were similar to the analysis processes after next generation sequencing. Obtained transcripts of high quality were classified by PLEK software, and the transcript sequences of encoded protein and noncoding protein were obtained. Then, sequences with a length greater than or equal to 200 bp were selected from the transcript sequences of non-coding proteins. Then, EMBOSS filter was used to remove the transcript sequences encoded by ORFs with more than 100 amino acids. Finally, BLAST was used to compare the remaining transcripts to the NR protein database to further filter out the gene sequences of encoding proteins and highly reliable lncRNA transcript sequences were obtained.

Expectation
With more and more researchers paying attention to lncRNA and the continuous development of sequencing technology, more and more lncRNA have been identified, of which some play a very wide role in the organism. lncRNA is also very important in the generation and development of cancer, which shows a variety of biological functions in cancer, such as epigenetic regulation, inhibition or activation of gene expression. It has been fully recognized that lncRNA plays an important role in organisms. However, compared with other researches, researches on lncRNA are still at the beginning stage so far, because there are still many problems that need to be solved urgently about lncRNA.
There is still no accurate method for the identification of lncRNA. Some transcripts have open reading frames but cannot encode proteins, while some lncRNA have the function of translating small peptides because of the definition of lncRNA. In these two cases, if we take whether proteins can be encoded as the basis to determine whether transcripts are lncRNA, errors will be caused.
Due to the complexity and diversity of lncRNA function and structure, the identification of lncRNA should not only be through several existing methods. We need to develop more effective methods to identify lncRNA and conduct more extensive and systematic studies.