Scientists' effort to piece together the genome is taking a significant step forward with a new computerized method that creates more complete and detailed versions of the complex puzzle of life than have ever been produced before.
"We hope and expect this advance will change how new genomes will be sequenced and studied since it gives such an improved view of what is really there," said Michael Schatz, Bloomberg Distinguished Associate Professor of computer science and biology within Johns Hopkins University's Whiting School of Engineering and Krieger School of Arts and Sciences. Schatz coordinated a group of 17 scientists from nine institutions in publishing their results in the current issue of Nature Methods.
"Without this approach, you will simply miss a lot of important gene sequences, and many errors can be introduced," said Schatz, who worked closely with researchers at Pacific Biosciences in Menlo Park, California, and Cold Spring Harbor Laboratory in Cold Spring Harbor, New York. He was joined by Johns Hopkins colleague Fritz J. Sedlazeck, a postdoctoral researcher.
Also involved in the research were scientists from the U.S. Department of Energy Joint Genome Institute; the Salk Institute for Biological Studies; University of California, Davis; University of Nevada, Reno; Universita degli Studi di Verona, Verona, Italy.
The development of the two algorithms, FALCON and FALCON-Unzip (which are available free to the public), Schatz said, is analogous to the move from a primitive telescope "that can only see the closest, brightest objects in the sky, to the Hubble space telescope that can dramatically improve the resolution to see things that are much more distant and in much greater focus."
The improvement from previous methods could have a significant impact in biology and medicine, as "genome assembly is one of the most fundamental and important steps in molecular biology to study the genetics of any living thing," he said.
Beginning in the 1970s, genome sequencing—the process of determining the complete DNA map of an organism at one time—has since produced the life codes for a number of microorganisms, plants, and animals, including humans. While many of these have been touted as the "whole genome," most of them, including the human genome, are not. In most published genomes, big pieces of the picture have been left out.
In producing more detailed genomes of three important species, including the Cabernet Sauvignon red wine grape, the researchers who worked on the new paper show how their approach improves on previous methods. Specifically, most other approaches, Schatz said, "would completely skip the fact that our genome and many genomes are actually 'diploid' and have two copies of each chromosome—one from mom and one from dad. Those two copies can be very different from each other, including genes or mutations that you only inherit from your mother or from your father."
In all three species studied for this paper—Cabernet Sauvignon, a widely studied flowering plant called Arabidopsis thaliana, and a coral fungus—the scientists found large segments of DNA that were specific to one of the two copies.
"An analogy of this might be that previous methods for sequencing genomes would give you a black-and-white representation—'haploid,' just 1 copy of each chromosome," Schatz said. "But our new software gives you a full-color representation allowing you to see all the details hidden in the shadows."
The greater detail and accuracy is partly due to the fact that the algorithm produces fewer and bigger puzzle pieces. That is, longer sections of the four chemical compounds known as nucleotides that make up the DNA sequence: adenine, cytosine, guanine, and thymine. Larger sections mean fewer gaps and greater precision in understanding what the sequence means.
In the grape genome, for instance, previous methods left the genome shattered into up to 12.8 million pieces averaging only 1,000 nucleotides long. The new approach for the Cabernet Sauvignon grape produces 718 contiguous pieces averaging 2.1 million nucleotides long.
One technical challenge of producing the longer segments was the error rate of 10 to 15 percent in the single molecule sequencing technology used. However, using sophisticated statistical and computational filtering—a corrective lens for DNA sequencing—the system reduces that to on average one error every 10,000 to 100,000 nucleotides, an average error rate of only 0.01% to 0.001%.
Another group of scientists has already used versions of the software to assemble and study the gorilla genome earlier this year, and Schatz said his lab is using the method to study plants, animals, and microorganisms that cause disease as well as healthy and diseased human genomes, including studies of cancer.