Yuan Gao greets me with a question: "Do you have a sentence you can give me?"
We are in his office at the Lieber Institute for Brain Development on the Johns Hopkins medical campus. Gao, an investigator at Lieber and an associate professor of biomedical engineering at the Whiting School of Engineering, waits, fingers poised over his computer keyboard. I respond to his request: "How does DNA encoding work?"
Gao nods and types the sentence into the computer. On the monitor, a string of letters appears: CTACACGAGCTCTTCCGATCT and on and on for about 500 letters. If that sequence seems familiar, it is because it represents the string of nucleotides that comprise DNA: adenine (A), cytosine (C), guanine (G), and thymine (T). Gao's computer has taken my sentence and converted it to DNA code. Were he then to run that code through a machine known as a DNA synthesizer, he could produce actual strands of synthetic DNA, either in solution or as a powder, that contained my sentence in the sequence of nucleotides. If I ever wanted to retrieve my words, I could simply run the process in reverse, using a DNA sequencer to "read" the code and convert it to text.
The point of using synthetic DNA as a sort of double-helix hard drive is simple—you can store an astounding amount of data in the tiniest amount of space, and that storage will be stable for what amounts to forever. George Church, a geneticist at Harvard University, has calculated that a single gram of DNA could hold 4.5 billion gigabytes of data and remain stable for millennia. "Theoretically, you could store everything in the universe," Gao says. Plus DNA will always be DNA, which means the data will still be readable thousands of years from now, unlike all the data written in obsolete computer languages that have accumulated on cassette tapes, flash drives, and Zip disks. Remember Zip disks? Tried to read one lately?
You will not be able to buy a laptop with a DNA drive anytime soon, but Gao recently contributed to a demonstration of DNA's data-storage potential, led by Church, who formerly was Gao's postdoctoral adviser. Along with Sriram Kosuri, also of Harvard, Church and Gao took a volume co-authored by Church, Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves, converted the book's 53,426 words, one JavaScript program, and 11 images from HTML format to binary code—a long string of 0s and 1s—and used a computer script written by Church in the Perl programming language to convert those 0s and 1s into DNA code—strings of ACGTs. They sent that code by email to a company, Agilent, which ran it through a synthesizer and produced synthetic DNA that contained several copies of Church's book. Several in this case meaning 70 billion. Quite the press run.
Sequencing always introduces errors—DNA typos, in a sense—which could result, for example, in my retrieved sentence reading, "How does DNB encoding work?" To counter this, the technique developed by Church sequences the same DNA code 3,000 times, then compares the sequences to produce a consensus of the correct code, an ingenious sort of proofreading that virtually eliminates errors. Gao likens it to 3,000 proofreaders reading a text and voting on the correct placement of each letter and punctuation mark. One editor might miss something. Three thousand will not.
After walking me down a hall in the lab to show an example of a DNA sequencer, which bears an uncanny resemblance to a microwave oven attached to one of those mini refrigerators popular in dorm rooms, Gao points out an odd aspect of sequencing—errors can happen anywhere, but they tend to occur near the end of the string of nucleotides. "It's like a human reading," he says. "At the beginning, you're focused. At the end you are tired and distracted. Sequencing is the same way. It's amazing."
Posted in Science+Technology
Tagged biomedical engineering, big data, data storage, synthetic biology