The recent announcement by George Church and co-workers at Harvard University of plans to bring mammoths 'back from extinction' — more properly, to create elephants with mammoth features by splicing their genomes with fragments of those preserved in mammoth remains — has provoked some scepticism. But that optimistic scheme does at least remind us how good DNA is at preserving information over very long times.

That's worth remembering if the notion of storing computer information in the chemical structure of a 'fragile' organic molecule seems risky. Church has also been one of the leading proponents of efforts to do just that: to use DNA as a storage medium for information technology. As over 20 years have passed since the notion of DNA computing was introduced1, and as the capacity of DNA for encoding digital information was in fact apparent ever since Watson and Crick's discovery of the double helix in 1953, it might seem surprising that it has taken so long for DNA storage to be seriously contemplated. What has made the difference is the rapid evolution over the past two decades of technologies for making long stretches of DNA to precise specifications at commercially viable cost, and likewise for reading out the information. Those techniques were not, of course, developed with data storage in mind, but to support gene manipulation and sequencing for biotechnology and genomics. They are augmented now with tools for pinpoint editing of DNA sequences using the CRISPR/Cas9 system — giving would-be DNA data engineers a complete set of affordable tools for input, editing and readout.

The very real potential of DNA as a storage medium has been amply demonstrated. In 2012 Church et al. stored and read out a 5 MB file (a book) using microchips and advanced sequencing2; Goldman et al. subsequently showed how the approach might be scaled up while retaining 100% accuracy3. CRISPR/Cas9 has been used to find and edit specific stored strings in a DNA random-access memory4. All the same, encoding arbitrary data remains compromised by several factors. Whereas in principle each nucleotide (A, T, G and C) can encode two bits (00, 01, 10, 11), this coding capacity can't be reached in practice. For one thing, some sequences — for example, those with high guanine-cytosine content — are hard to synthesize and read out accurately. Furthermore, variability in the synthesis and stability of some oligonucleotide sequences makes them unavailable for coding. So the actual storage capacity per nucleotide is limited to 1.8 bits. Previous studies have attained no more than 60% of this theoretical limit.

Credit: PHILIP BALL

Erlich and Zielinski have now used an algorithm familiar to computer science, called fountain codes, to improve this performance5. It involves dividing the data set into short, non-overlapping segments in a way that screens and eliminates error-prone sequences. These message fragments, called droplets, are also tagged with random-number 'barcode' sequences for subsequent reassembly.

In this way Erlich and Zielinski could encode and read various data files — a computer operating system, an early motion picture, a computer virus and others — and not only read them back with perfect fidelity but also create error-free copies. The fountain algorithm achieves a capacity per nucleotide of 86% of the theoretical maximum.

This won't yet make DNA data storage a commercial reality: it is still far too costly (about US$3,500 per MB here). Yet not only are these costs still falling, but methods like this can also tolerate cheaper, less accurate synthesis methods than those developed for genomic technologies.