Carbon-based archiving: current progress and future prospects of DNA-based data storage

Abstract The information explosion has led to a rapid increase in the amount of data requiring physical storage. However, in the near future, existing storage methods (i.e., magnetic and optical media) will be insufficient to store these exponentially growing data. Therefore, data scientists are continually looking for better, more stable, and space-efficient alternatives to store these huge datasets. Because of its unique biological properties, highly condensed DNA has great potential to become a storage material for the future. Indeed, DNA-based data storage has recently emerged as a promising approach for long-term digital information storage. This review summarizes state-of-the-art methods, including digital-to-DNA coding schemes and the media types used in DNA-based data storage, and provides an overview of recent progress achieved in this field and its exciting future.

Thank you for sparing your valuable time and providing useful suggestions. We have read your comments and made changes to the manuscript accordingly. We have carefully revised the manuscript, and we have also reconstructed some of the sections and added some new information as per your suggestions. Our specific responses are as follows: 1)[Page 3, : A 4th unique feature of DNA that might be included is the ease and rapidity with which DNA can be replicated using, for example, PCR. Response: We thank you for your valuable suggestion. As the PCR technique is now well-developed and cost-effective, the replication of DNA sequences encoding digital files are easy and efficient. Meanwhile, for in vivo DNA storage, living cells could also replicate rapidly as long as it is active and has sufficient food supply. Therefore, the convenience of file replication and a backup would be another unique feature for DNAbased data storage. The description of this feature is mentioned in [Page 3, Lines18-21].
2)[Page 4: Line 8]: The authors write "there is a trade-off between accuracy and redundancy". In my interpretation, this is counter-intuitive, as additional redundancy should reduce errors. Response: We thank you for your useful suggestion. The additional redundancy, including error-correction codes, are designed to ensure the fidelity of DNA-based data storage. However, the redundancy will use resources (i.e. bases in DNA sequence) and thus reduce the coding density. That is why we mentioned about "the trade-off between accuracy and redundancy". To clarify this opinion, we have elaborated the details in [Page 4, Lines 10-14].
3)[Page 4: Lines 13-15]: Concerning random access, many experimental works demonstrating DNA data storage do not have random access. Thus it may not necessarily be a requirement. Can the authors discuss this further? Response: We thank you for your valuable suggestion. It is true that many experimental works did not consider random access in DNA-based data storage. However, in the large-scale orthodox storage system (e.g. computer system), random access is one of the most basic features for data retrieval. As a result, the research team from the University of Washington and Microsoft reported the significance of their work on random access for DNA-based data storage. We emphasize the importance of random access in [Page 4, Lines 17-19]. 6)[Pages 11 -12]: The authors might also want to mention other methods of storing data in vivo, for instance with recombinases, and other molecular recorders like Cas9. Response: We thank you for the useful suggestion. In some recent works, molecular tools like CRISPR-Cas has been described for writing information in vivo. The corresponding work is mentioned in [Page 12, Lines 15-18]. Similarly, the possible application of CRISPR and recombinase in DNA-based data storage is mentioned in [Page 14, Lines 8-10]. 7)[Page 13, lines [23][24][25]: Is length really the major challenge? Why not just writethroughput in general, which can be increased by synthesis of longer strands (as stated), and/or by writing more strands in parallel (which is not mentioned) for instance by making larger, more dense oligo synthesis arrays. Response: We thank you for your valuable suggestion. We consider oligo length as one of the major challenges because in DNA-based data storage, in order to retrieve the data, we need indices (e.g. 1,2,3…) to record the address of oligo in a pool of oligo mixture. With the increase in file size , more oligo will be needed and thus larger indices. Therefore, the index region in a data-encoded DNA sequence would be longer and reduce the coding efficiency. With longer oligo length, the number of oligos required to store a file with the same size will be reduced and thus the length of the index region. Dear Reviewer #2, Thank you for sparing your valuable time and providing useful suggestions. We have read your comments and made changes to the manuscript accordingly. We have carefully revised the manuscript, and we have also reconstructed some of the sections and added some new information as per your suggestions. The objective of this manuscript is to help the readers to understand that coding scheme and storage medium are the two major research focus in DNA-based data storage field. Moreover, we would like to also introduce the challenges in current DNAbased data storage, which may inspire the related researchers to have some ideas for further studies. Therefore, for coding schemes, we tried to introduce some key yet wellaccepted bit-to-base algorithms and stated their improvement with respect to coding density, the capability of error correction, and capability of random access. Since there are currently no systematic studies on storage media, we introduced some representative works which employed in vivo and in vitro strategy, and showed their comparative description. Our specific responses are as follows: 1)Page 3, Advantage of using DNA for storage would also include: easy amplification in vivo by live cells at very low cost, and possible amplification in vitro by enzymatic reaction, e.g., PCR or linear amplification in silico. Both approaches can be used to scale up the backup copy production. One should also consider the possible employment of repair system for correcting errors. Response: We thank you for your precious suggestion. As the PCR technique is now well-developed and cost-effective, the replication of DNA sequences encoding digital files are easy and efficient. Meanwhile, for in vivo DNA storage, living cells could also replicate rapidly as long as it is active and has sufficient food supply. Therefore, the convenience of file replication and a backup would be another unique feature for DNAbased data storage. The description of this feature is mentioned in [Page 3, Lines 18-21].
2)Page 5-10, the description of the coding schemes is quite sketchy. The outline for each approach was brief and was not well illustrated by the panels in the figure 1 and 2. Better schematics may help, without the need to go back to the original papers to make detailed comparison. Response: We thank you for your valuable suggestion. In this section, we presented the differences between coding schemes by their different bit-to-base transcoding method for optimize coding efficiency and strategies to add redundancy to ensure fidelity. Although we chronologically presented various coding schemes, we tried to deliver the information that all the coding schemes made improvement based on these two ways. In order to further emphasize it, we have stated this opinion just before the description of coding schemes. Furthermore, some more details are given for each coding scheme, as well as the description of our current understanding and perspectives. We reconstructed the figure.1 and corrected the mislabeled figure caption in [Page 5] for better presentation.
3)Page 10-12, in vivo and in vitro storage of the information -a thorough comparison of the pros and cons would be helpful, instead of factually describing what methodology is available. The error generated in vivo by mutation should be contrasted to the error in DNA synthesis technology to evaluate the limitation of these tools. Response: We thank you for this nice suggestion. The in vivo and vitro storage are two strategies that can be distinguished in many aspects. We compared the pros and cons, and added one additional paragraph to discuss the comparison using a table. This paragraph is in [Page 13, Lines 11-20]. 4)Page 13, line 4-10. A very typical way of this review in describing methodology citing the previous reports without describing the details and contrasting the differences sufficiently. It does not serve the purpose of a proper analysis of how each method advances the development of storage. Response: We thank you for your useful suggestion. Since the concept of DNA digital storage was put forward in 2012, many strategies has been reported. In this review, we mainly reviewed the major event of this field in coding scheme and storage medium aspect, instead of molecular tools, synthesis techniques, etc. Some other strategies, including Song et al., and Lee et al., did not made much improvement in coding efficiency or application demonstration. But they also gave some inspirations for this field of research which should not be neglected in our point of view. Therefore, we mentioned these studies briefly for readers who are interested in this field. In addition, we provided brief opinions instead of mere describing the methodology in this part. The amended paragraph is in [Page 14, Lines 2-10]. 5)Page 14, sequencing accuracy issue was discussed concerning the data retrieval process. While table 1 summarizes the factual information of the technology available, no clear evaluation of the future direction is given, same as pointed out in (4) above. Response: We thank you for your precious suggestion. High-throughput sequencing is one of the most significant tools for DNA-based data storage. We summarized the current techniques by their cost and throughput. We also added some evaluation of the future development direction of the sequencing technique. Besides we also reconstructed the section of "Challenges of DNA-based data storage". The amended section is in [Page 14, Line 11-Page 17, Line 13]. Fig. 4, the appending figure has no label of Y axis. Response: We thank you for your valuable suggestion. The corrected appending figure has been now re-uploaded accordingly.

6)
Thank you again for the peer reviewing. Click here to access/download;Manuscript;Carbon-base -V8_SCE_PZ_edit 20190603.docx Click here to view linked References † These authors contributed equally to this work. 1

Abstract 2
The information explosion has led to a rapid increase in the amount of data requiring 3 physically storage. However, in the near future, existing storage methods (i.e., magnetic and 4 optical media) will be insufficient to store these exponentially growing data. Therefore, data 5 scientists are continuously looking for better, more stable and space-efficient alternatives to 6 store these huge datasets. Because of its unique biological properties, highly condensed DNA 7 has great potential to become a storage material for the future. Indeed, DNA-based data 8 storage has recently emerged as a promising approach for long-term digital information 9 storage. This review summarizes state-of-the-art methods, including digital-to-DNA coding 10 schemes and the media types used in DNA-based data storage, and provides an overview of 11 recent progress achieved in this field and its exciting future. Introduction to DNA-based data storage 16 The concept of DNA-based data storage was introduced by computer scientists and engineers 17 in the 1960s [1]. In one pioneering attempt, made in 1988 by Joe Davis in his seminal artwork 18 "Microvenus" [2], an icon was converted into a string of binary digits, encoded into a 28 19 base-pair (bp) synthetic DNA molecule, and was later successfully sequenced to retrieve the 20 icon [2]. Although Microvenus was originally designed for interstellar communications, it 21 demonstrated that non-biological information could also be stored in DNA. Later, in the early With its double-helix structure and base-stacking interactions, DNA can persist a thousand 3 times longer than a silicon device [4], and survive for millennia, even in harsh conditions [5-4 8]. Secondly, DNA possesses a high storage density. Theoretically, each gram of single-5 stranded DNA can store up to 455 exabytes of data [9]. As storage strategies continue to 6 improve, scientists have now achieved a density that could reach this theoretical limit. Thirdly, 7 DNA can be easily and rapidly replicated through the polymerase chain reaction (PCR), 8 thereby providing the possibility for large-scale data backup. It should not be neglected that 9 living cells are also perfect tools for in vivo information replication and backup. Last but not 10 least, the biological properties of DNA enable current sequencing and chemical synthesis 11 technologies to read and write the information stored in DNA, thereby making it an excellent 12 material to store and retrieve data [9]. 13 The recently announced Lunar Library TM project aims to create a DNA archive of a collection 14 of 10,000 images and 20 books for long-term backup storage on the Moon. This highlights the 15 advantage and immense potential of DNA as a medium for long-term digital data storage. 16 The accessibility of DNA-based data storage is mainly driven by two empowering techniques: 17 DNA synthesis for 'encoding', and DNA sequencing for 'decoding' [10]. Typically, digital 18 information is first transcoded into ATCG sequences using a predeveloped coding scheme. 19 These sequences are then synthesized into oligonucleotides (oligos) or long DNA fragments 20 to allow long-term storage. To retrieve the data, a DNA sequencing method is applied to 21 obtain the original ATCG sequence from the synthesized DNA. Summarizing the findings of earlier studies, an optimal coding scheme usually outperforms in 1 achieving three main features: 2 1) High fidelityduring data retrieval, there is a trade-off between accuracy and 3 redundancy. While additional redundancy helps to improve accuracy, it also increases 4 data size. Hence, to strike a balance, appropriate coding scheme and error correction 5 strategies are applied to avoid and rectify errors induced during DNA synthesis or 6 sequencing. 7 2) High coding efficiencyby having four elementary bases, DNA has the theoretical 8 coding potential to store at least twice as much information in quaternary scaffolds as 9 binary codes. 10 3) Flexible accessibilityfrom a computer science standpoint, stored data is expected to 11 have random access. Lack of random access hampers attempts to scale up the data 12 size because it will be impractical to sequence and decode the whole dataset each 13 time when we only want to retrieve a small amount of data. 14 Correspondingly, proposed coding schemes are usually designed to fulfill all of the above 15 characteristics. Generally, DNA-based data storage coding schemes can be differentiated by 16 their binary transcoding methods (Fig. 1), or by the ways in which they add redundancy to 17 increase fidelity (Fig. 2). and synthesis (e.g. repeated sequences, secondary structure and abnormal GC content) [9]. By 22 employing the free base swap strategy (a 'one-to-two' binary transcoding method, Fig.1A), 23 Church and colleagues encoded approximately 0.65 MB data into ~8.8 Mb DNA oligos of 24 159 nucleotides (nt) in length. Given the large amount of digital data that were successfully 25 stored in DNA, this was considered to be a milestone study [15], and it also demonstrated the 1 potential of DNA-based data storage to cope with the challenge of the information explosion. 2 However, to allow its base swapping flexibility, this coding scheme sacrifices information 3 density by transcoding each binary code into one base. Later researchers have developed 4 other coding strategies to overcome this issue while maintaining comparable performance. 5 6 Huffman coding scheme 7 Huffman code, developed by David Huffman in the 1950s, is considered to be an optimal 8 prefixed code that is commonly used for lossless data compression. In 2013, Goldman and 9 colleagues adopted the Huffman code in their coding scheme, which effectively improved the 10 coding potential to 1.58 bits/nt [12]. Before transcoding into DNA nucleotides, binary data 11 were first converted into ternary Huffman code, and then transcoded to DNA sequences by 12 referring to a rotating encoding table (Fig. 1B). Each byte of the resulting data was substituted 13 by five or six ternary digits (comprising the digits '0', '1', and '2' only) by Huffman's 14 algorithm [16]. Encoding in this way, as per the rotating table, eliminates the generation of 15 mononucleotide repeats and can compress the original data by 25-37.5%. For ASCII 16 (American Standard Code for Information Interchange) text format files, this type of 17 compression further outperforms by mapping the most common characters to five-digit 18 ternary strings [12]. However, the transcoding algorithm cannot prevent abnormal GC 19 distribution when dealing with certain binary patterns. In addition, this coding scheme 20 employs simple parity check coding to detect errors, and maintains a four-fold coverage 21 redundancy to prevent error and data loss ( Fig. 2A). However, while the simple parity check 22 coding can detect errors, it cannot correct them. Moreover, increased redundancy inevitably 23 lowers the coding efficiency. Although not perfect, this work not only improved coding 24 efficiency and prevented nucleotide homopolymers, but also introduced a strategy to ensure 25 fidelity by adding redundancy. encoding principle [13], using an XOR (⊕) operation to yield redundancy. As shown in Fig.  4 2B, every two original sequences, A and B, will generate a redundant sequence C by A⊕B. 5 Therefore, with any two sequences (AB, AC or BC), one can easily recover the third 6 sequence. This coding scheme also provides the flexibility of redundancy according to the 7 level of significance of particular data strands, namely 'tunable redundancy'. It decreased the 8 redundancy of the original data from three-fold to half, providing an efficient way to ensure 9 fidelity. In practice, this coding scheme successfully encodes four files with a total size of 10 151 KB, and recovers three out of four files without manual intervention [13]. 11 The need to amplify target files in a large-scale database suggests a necessity for random improving potential data density to ~1.78 bits/nt. With the two-byte (8×2 bits) fundamental 23 information block, this coding scheme introduced a finite field (Galois field; GF) of DNA 24 nucleotide triplets as its elements (Fig. 1C). To prevent mononucleotide repeats of greater 25 than 3 nt during encoding, the last two nucleotides of the triplet are varied, which can give 48 26 different triplets. A GF of 47 was used because 47 is the largest prime number smaller than 48. 1 The information block is then mapped to the three elements in GF (47), i.e., 256 2 to 47 3 . The 2 RS code is applied in this scheme to detect and correct errors. As shown in Fig. 2C, two  3 rounds of RS coding are applied horizontally and vertically to the matrix generated by GF 4 transcoding, respectively. 5 In this pilot study, 83 KB of text data were encoded in silico [17]. Although the data size was 6 not impressive, it underlined the necessity to apply error-correction coding, and significantly 7 enhanced coding efficiency. Moreover, error-correction code from the information 8 communication field was applied to DNA-based data storage for the first time. Blawat and colleagues proposed a coding scheme to particularly tackle the errors generated 12 during DNA sequencing, amplification and synthesis (e.g., insertion, deletion and substitution) 13 [18]. The potential coding density was 1.6 bits/nt. Two reference coding tables are specified 14 in advance. A one-byte (8 bits) fundamental information block is assigned to a 5-nt DNA 15 sequence, and the third and fourth nucleotide are swapped (Fig. 1D). Two other criteria are 16 also applied to prevent mononucleotide repeats during this process: 1) the first three 17 nucleotides should not be the same; and 2) the last two nucleotides should not be the same. can then be mapped to DNA blocks A and B as required, e.g., alternately mapped to A or B. 22 In this study, 22 Mb of data was successfully encoded and stored in an oligo pool. Those data 23 were retrieved without error, thereby proving the feasibility of the 'forward error correction' 24 coding scheme. However, this was not the case for detecting and correcting single mutations. 25 For example, '11100011' could be mapped to a DNA block 'TGTAG'. but if an A-to-T 26 transversion occurs, the DNA block will be changed to 'TGTTG', which will give an error 1 byte '11101111' after decoding. is a widespread method of coding information in communication systems, and is well known 6 for its robustness and high efficiency [20]. Fountain code is also known as a rateless erasure 7 code, in which data to be stored are divided into k segments, namely resource packets. A 8 potentially limitless number of encoded packets can be derived from these resource packets. 9 When it returns n (n > k) encoded packets, the original resource data will be perfectly 10 recovered. In practice, n only needs to be slightly larger than k to yield greater coding 11 efficiency and robustness for information communication [21]. 12 Binary data nucleotide sequence transcoding is also carried out. A fundamental two-bit to 13 one-nucleotide transcoding table is adopted, in which [00, 01, 10, 11] is mapped to [A, C, G, 14 T], respectively (Fig. 1A). Firstly, original binary information is segmented to small blocks. 15 These blocks are chosen according to a pre-designed pseudorandom sequence of numbers. A 16 new data block is then created by the bitwise addition of selected blocks with random seeds 17 attached and transcoded to nucleotide blocks according to the transcoding table. 18 Mononucleotide repeats and abnormal GC content are prevented by a final verification step 19 (Fig. 2D) [19]. 20 The oligos in this coding scheme are correlated and have grid-like topology to realize 21 extremely low but necessary redundancy. This study increased the theoretical limit of coding 22 potential to an unprecedentedly high value of 1.98 bits/nt, and remarkably reduced the desired 23 redundancy for error-free recovery of the source file. Moreover, the mechanism of random 24 selection and validity verification ensures that long single-nucleotide homopolymers do not 25 appear in the encoded sequence. However, in this coding scheme, the complexity level of 26 encoding and decoding is not linearly correlated to the data size. Thus, decoding can be 1 complicated and may require more resource and a longer computation time. However, 2 although it is claimed that a 4% loss of total packets would not affect the recovery of the 3 original file in the report, in terms of the features of DNA fountain code, loss of more packets 4 may cause complete failure of recovery. If the ultimate aim is to permanently store the data, 5 the amount of redundancy must be increased to ensure information integrity. 6 If we consider DNA-based data storage solely as an archiving process with high fidelity, then 7 DNA fountain coding appears to be the only communication-based coding scheme. In DNA-8 based data storage and retrieval, the most common error is caused by a single nucleotide 9 mutation. To address this issue, most coding schemes create high redundancy to tackle the 10 challenging conditions of current communication channels. However, these error correction 11 algorithms require complex decoding procedures and large amounts of computing resources. 12 Here, the use of a fountain coding scheme firstly shows that it is unnecessary to employ error 13 detection/correction algorithms, and this provides us with an alternative solution for 14 improving the performance of DNA coding.

In vivo DNA-based data storage 21
In vivo DNA-based data storage was commonly adopted in pioneering DNA-based data 22 storage work, such as the Microvenus project, which used bacteria as the storage medium [2]. 23 In the 2000s, other research teams also proposed simple techniques for in vivo DNA-based 24 data storage, e.g. the use of codon triplets to encode alphabets [22] or bits [23] by either 25 transferring plasmids or introducing site-directed mutagenesis. Typically, encoded DNA 1 sequences are firstly cloned into a plasmid and then transferred into bacteria. Therefore, the 2 DNA sequences, and the information they carry, can be maintained in tiny bacteria and their 3 billions of descendants. 4 Nevertheless, the capacity of bacteria for carrying plasmids is limited by the type and size of 5 plasmid. In addition, plasmid mutation is quite common in bacteria. During bacterial 6 replication, take Escherichia coli as an example, the spontaneous mutation rate is 2.2 × 10 -10 7 mutations per nucleotide per generation, or 1.0 × 10 -3 mutations per genome per generation 8 [24], with a generation time of 20-30 minutes, whichafter a few yearsmight ultimately 9 alter the information stored.  Hence, the index should be as short as possible to save the information capacity in each oligo. 24 Apparently, many more indices will be needed if more DNA oligo sequences are generated 25 and mixed. However, similar to in vivo DNA-based data storage, the larger data size demands 1 more DNA oligos for in vitro DNA-based data storage. This increases the size of indices in 2 oligo and thus lowers the storage capacity and efficiency. 3 To overcome these problems, longer DNA fragments can be used instead of DNA oligos. In 4 2017, Yadzi et al successfully encoded 3,633 bytes of information (two images) into 17 DNA 5 fragments, and recovered the image using homopolymer error correction [28]. Nevertheless, 6 the current cost of DNA fragment synthesis is higher than that of oligo synthesis, which 7 increases the overall cost of DNA fragment-based storage. 8 Above all, both in vivo and in vitro strategies have been employed in current DNA-based data 9 storage research. However, the nature of these two strategies demonstrates the usage of 10 different techniques and different application scenarios (Table 1). Although in vivo storage is 11 a more complicated procedure than oligo pool synthesis in terms of backup cost, in vivo 12 DNA-based data storage is more cost-effective. The cost of the in vitro method has been 13 reduced with the development of array-based oligo synthesis and high-throughput sequencing. 14 Considering long-term storage, DNA in an in vivo condition will degrade more slowly than in 15 vitro. Nevertheless, errors induced by mutations during replication in vivo are more 16  Other pioneering work goes beyond the aforementioned DNA-based data storage system. datasets. Therefore, studies during that time were only conducted as a proof-of-concept on a 7 relatively small scale [2]. The concept of massively parallel sequencing (or next-generation sequencing; NGS), a high-22 throughput sequencing method, was proposed in 2000 [38]. In the following years, 23 sequencing by ligation and by synthesis became major players in the sequencing field. 24 Multiple NGS platforms became commercially available (e.g. 454, Solexa, Complete 25 emerging technique also comes with limitations. Most NGS platforms require in vitro 1 template amplification with primers to generate a complex template library for sequencing. 2 During this process, copying errors, sequence-dependent biases (for example, in high-GC and 3 low-GC regions and at long mononucleotide repeats) and information loss (for example, 4 methylation) are produced [9]. 5 In 2012, Church and colleagues successfully demonstrated the first application of high-6 throughput DNA synthesis and NGS in DNA-based data storage [9]. It initiated rapid 7 development of coding schemes incorporating NGS. Two of the most common goals at this 8 stage were how to improve coding efficiency, and how to correct sequencing errors. 9 10 optimal robustness and high efficiency [28]. This study implies a possible shift from NGS to 8 single-molecule sequencing because of its potential for compactness and stand-alone DNA 9 data storage systems [13,30]. Table 2 summarizes the frequently used sequencing platforms 10 in DNA-based data storage. Recently, Oxford Nanopore Technologies announced plans to 11 develop a 'DNA writing' technique using their nanopore technology. Using the same platform 12 to both read and write, they claim it will be possible to selectively modify native bases and 13 stimulate localized reactions, such as light pulses for encoding, which will provide real-time 14 read and write capabilities for DNA-based data storage [42]. 15 In 2018, Oxford Nanopore also launched a high-throughput sequencing platform, 16 PromethION, stating that it has the potential to yield up to 20 Tb of data in 48 hours [43,44]. 17 The first metagenomics data published using the PromethION demonstrated that it is already 18 possible to obtain 150 Gb of data from two flowcells in a 64-hour run [45]. Further 19 developments and improvements are in progress. Since the performance of this technology is 20 getting closer to that of its NGS counterparts, it may play a more prominent role in the future 21 study of DNA-based data storage. Taken together, DNA-based data storage techniques provide us with the great possibility to 1 manipulate DNA as a carbon-based archive with excellent storage density and stability. 2 Imperfect as it is, it may become the ultimate solution to the current data storage market for 3 long-term archiving. We are also excited to see that multidisciplinary research companies 4 have already joined this revolution to make DNA-based archiving commercially viable. 5 In terms of coding schemes, although the current theoretical limit of bit-base transcoding is 6 2 bits/base, newly discovered unnatural nucleic acids could expand the choice of bases for 7 transcoding, and thus increase the theoretical limit. X and Y are two classical unnatural 8 nucleic acids that demonstrated the capability to be integrated into normal cells, and in pairing, ). Hence, we 22 could obtain the correlation between an optimal index length and DNA oligo length. 23 As Figure 4 shows, as DNA oligo length increases, the index length decreases, while net 24 coding efficiency increases. Some startup companies are now reportedly aiming to develop 25 industrial enzymatic DNA synthesis technology. If they can successfully synthesize oligos 1 greater than 200-mers, the efficiency of DNA-based data storage will markedly improve. 2 In addition, the scale of DNA synthesis also affects the information capacity of DNA-based 3 data storage per unit mass. With the development of array-based DNA synthesis technology, 4 high-throughput oligo synthesis is currently directed to the microscale level. In DNA-based 5 data storage, the information capacity of a certain mass of DNA sequences also relates to the 6 copy number of each DNA molecule. The correlation between information capacity C and 7 copy number Nm of each oligo can be calculated from: = × ( where n represents the number of bytes carried by each oligo (normally 10-20 bytes/molecule 9 according to different coding schemes); μ is the number of nucleotides per molecule, δ is 10 320 Dalton/nucleotide; and γ is 1.67 × 10 -24 g/Dalton. To date, the copy number of oligos is 11 around 10 7 molecules in on-chip high-throughput synthesis (without dilution) [19]. According 12 to Equation 2, this will give an information capacity level of ~10 13 bytes/g. If the copy 13 number is decreased to 10 4 molecules per oligo, the information capacity will increase to 14 ~10 16 bytes/g. Additionally, synthesis in microscale will also reduce the cost by several orders 15 of magnitude and save the dilution step.  Apart from companies with biological backgrounds, information technology (IT)-based 1 industries are also playing an important role in this revolution. As the coding schemes used in 2 DNA-based data storage must yet be improved to yield higher coding efficiency and fidelity, 3 efforts from the IT field could be of critical importance. For example, from random access 4 data retrieval to scaling up data storage [13], Microsoft successfully implemented its IT 5 philosophy in DNA-based data storage and is marching steadily towards its goal announced in 6 2017: a proto-commercial system in three years to storing some amount of data on DNA [48]. 7 A recent paper written in collaboration with a scientist from the University of Washington 8 described an automated end-to-end DNA-based data storage device, in which 5 bytes of data 9 were automatically processed by the write, store, and read cycle [48]. Further efforts to speed 10 up the coding and decoding process for daily storage applications are still essential. 11 We expect more entities and research organizations to join this cohort to eventually make 12 carbon-based archiving a reality, and, further, to attain immediate access storage (IAS) or 13 biological computation. Nevertheless, it remains a priority to maintain a safe and ethical 14 framework for the development of DNA-based data storage. Since DNA is the basic building 15 block of genetic information for living organisms, situations might arise in which synthesized 16 sequences are introduced into living host organisms, and this could lead to biological 17 incompatibility caused by unknown toxicity or other growth stresses. Hence, it is necessary to 18 evaluate the safety of sequences prior to their synthesis. We long to see the day when the 19 safety, capacity and reliability of DNA means it will become the next-generation digital 20 information storage medium of choice.  A) One binary bit is mapped to two optional bases [9]. Two binary bits are mapped to one 1 fixed base [11]. B) Eight binary bits are transcoded through Huffman coding and then 2 transcoded to 5 or 6 bases [12]. C) Two bytes (16 binary bits) are mapped to 9 bases [13]. D) 3 Eight binary bits are mapped to 5 bases [14].