COVID-19 information and resources for JHU
Illustration of scientist holding a box of tomatoes

Image credit: Davide Bonazzi

Genomics

The code breakers

From tomatoes to cancer cells, Michael Schatz and others at Johns Hopkins go deep inside genomes to unlock the secrets to life's variety

You probably missed it, but 2012 was a momentous year for the tomato.

That spring, the Tomato Genome Consortium announced that it had, for the first time, sequenced an entire tomato genome. The undertaking required a massive and complex effort, one that involved more than 300 scientists from 14 countries working collaboratively for almost eight years. They saw it as a critical first step to decoding the many characteristics of one of the world's most popular and economically dynamic fruits—its worldwide market estimated at $190 billion—and applying that knowledge to improving the quality and adaptability of other plant species. Juicy, drought-resistant tomatoes coming to a grocery store near you.

Last year, a handful of researchers co-led by Johns Hopkins computer science and biology Professor Michael Schatz completed another tomato sequencing project, albeit with a reshaped objective: decode the genomes of 100 different varieties of tomatoes. And they did it, as planned, in 100 days.

That's how dramatically the technology of gene sequencing has advanced in less than a decade.

"It's a totally new world for genomics," Schatz says. "Just the scale, the efficiency, the complexity that's possible now."

Schatz oversaw the 100 varieties project, along with Zach Lippman, a biologist and professor at the Cold Spring Harbor Laboratory in New York. The research, funded by the National Science Foundation, illustrated how increasingly sophisticated software has made it possible to zero in on the genetic components that make a tomato taste better, grow larger, and look delectable.

Illustration of a scientist outlining a human head in blocks

Image credit: Davide Bonazzi

"The real message of this project is that all the secrets are now being exposed," Lippman says. "And exposing those secrets allows you to ask: What can I learn from what's been hidden? And how can that help me further improve or accelerate how I would do breeding, not only in tomatoes but in any crop?"

The tomato was chosen not only for its economic significance but also for what Schatz describes as the "modest complexity" of its genome compared to that of many other plants. That said, a tomato plant's seeming simplicity belies its inner workings.

"The genetic code lays out exactly where a branch will occur, where the fruits will occur," he says. "It's incredibly complicated, but now we have the tools to go in there."

"In there"—whether within a tomato or a human—is a molecular universe that both inspires and confounds scientists like Schatz. He compares the functioning of a cell to the behind-the-scenes crew that makes a theater performance possible. The challenge comes with trying to figure out what's causing what, whether it's a cancerous tumor or early hair loss. "There are so many moving parts," he says.

With that challenge comes great potential to bring unimaginable precision to a wide range of biological interventions, from fine-tuning tomatoes to personalizing treatments for all kinds of medical conditions. The computational tools Schatz uses can make it possible for scientists to reengineer tomatoes that can better cope with climate change, for instance. But they also will likely play a key role in individualizing treatment of the genetic mutations that become cancerous tumors.

"The software," he says, "is universal."

Schatz is an earnest evangelist. He may work in what many would find an abstruse field, but he becomes animated as he explains the appeal of deciphering genomic enigmas. He actually started his career in cybersecurity but ultimately switched from foiling hackers to untangling DNA codes. It was a leap of faith inspired by the Human Genome Project and driven by his sense that it was a way computer scientists could help address truly important matters. He also likes to solve puzzles.

Simply put, genomic sequencing comes down to spotting patterns. Or more specifically, creating algorithms that enable computers to find matches in the maze of DNA sequences. That's where Schatz, a computer whiz who has learned to love biology, has made his mark. He has developed widely adopted methods and software for piecing together and interpreting genetic material, including a key advance in the ability to identify what are known as "long-read sequences." These are longer strands of DNA that can provide more context for analyzing a genome.

"A lot of genomics is like a jigsaw puzzle where you're trying to see which pieces fit together," he says. "Now if you had a puzzle where it was all blue sky and you're trying to piece together all these teeny-tiny pieces, there's a lot of complexity involved in how those pieces fit together. But if it's the same picture with much bigger puzzle pieces, it's easier to see how they lock together."

Much of Schatz's focus is on characterizing genomic variations, whether it's from person to person, tomato to tomato, or microbe to microbe. While most analysis of DNA mutations concentrates on single changes, Schatz's method targets larger mutations called structural variants, where genes and other large segments of the genome are duplicated or rearranged. Most of the time these changes are harmless, but sometimes they can radically affect how the cells grow or function in an organism. Some structural variants are associated with genetic diseases.

"One of the goals of my group has been to use this new technology to see structural variants that haven't been seen before," Schatz says. "With cancer patients, for instance, we're finding mutations and other important risk factors that are completely invisible through the standard way of looking at them."

Those breakthroughs are part of a profound shift in biological research, one that has opened up remarkable levels of molecular exploration.

A major impetus of biology's transformation was the Human Genome Project—the massive, 13-year global research undertaking that identified the roughly 20,000 protein-coding genes in human DNA.

"I thought it was the most exciting thing in all of science," says Steven Salzberg, a Johns Hopkins professor of biomedical engineering, computer science, and biostatistics who participated in the Human Genome Project research in the early 1990s. More than a decade later, Schatz was one of his doctoral students at the University of Maryland.

"With the invention of the microscope, scientists could suddenly see all these little squiggly things that they didn't even know were there before," Salzberg says. "So now, with this genome technology, we can actually read DNA, and RNA as well. We knew it was there, but until the 1950s nobody even knew what it did. And, we couldn't read it very well until the invention of sequencing technology in the 1980s."

Since then, the use of artificial intelligence in genomic science has accelerated, which has made it possible to analyze an escalating amount of genetic data. The same type of technology that has brought us Alexa and self-driving cars is also key to making sense of billions of pieces of DNA code—although instead of programming machines to behave like humans, in this case, the tech is designed to interpret data far beyond human capabilities.

For context on just how much data we're talking about, Schatz says that if converted to text, it would result in a book 10 trillion pages long. In essence, searching for variants, he says, is like looking "for a needle in a stack of needles."

Schatz explains that machine learning is quite effective in identifying gene variants and in finding mistakes in data that suggest variants which aren't real. He says that with the tomato research, artificial intelligence was helpful in recognizing genetic associations that haven't been seen before.

He notes that cross-breeding by tomato processing companies in the past has resulted in fruits that are larger and easier to harvest but have lost other qualities.

"The great thing about tomatoes is that you can find a wild ancestor in the mountains of South America that has amazing flavor and color," he says. In fact, one variety he sampled tasted like a combination of tomato and pineapple, a unique flavor you can't get commercially.

But there is a caveat. "Apparently, the leaves are a bit toxic."

The broader implementation of genomic sequencing, Schatz says, has come in waves, starting with the now more than a million people who have used 23andMe and similar companies to learn not only about their ancestry but also medical conditions and disease risk. Although that data is proprietary and generally not available to researchers, the popularity of these companies has raised the public's awareness of how much DNA analysis can reveal.

A second wave has occurred in hospitals, where sequencing of cancer cells from tumors is becoming more commonplace. But David Valle, director of the Institute of Genetic Medicine at Johns Hopkins, sees that as only a first step in how the technology will be applied to treatment. He believes that, in addition to analyzing patients' tumor cells, doctors will sequence their "constitutional" genome. This is the tissue we inherit from our parents that becomes incorporated into the DNA of every cell in our bodies.

That would facilitate a comparison of the malignant genome and the nonmalignant genome, which would help pinpoint the cancer's progression. It also could affect how it's treated.

"Let's say that in the constitutional genome, there's a variant that has an impact on blood clotting," Valle says. "That has nothing to do with the cancer, but it can affect how we decide to treat the cancer."

He notes that innovations in analyzing genomes, such as those Schatz has developed, are a key to the evolution of precision medicine. For all its promise, the domain of DNA can be extremely difficult to decipher. What might appear to be a correlation between a particular gene—or even variants within that gene—and a specific condition or trait, often is not. False positives are not unusual. At best, Valle says, he could make only educated guesses as to the impact of deactivating particular genes, and he concedes that many of those guesses probably wouldn't be accurate.

Gene behavior can be bewildering. Take, for example, Marfan syndrome, a genetic condition that causes a person to have unusually long arms and legs and loose joints. It has been traced to a variant in a particular gene. But research has found that a different variant in that same gene can result in what's known as "stiff skin syndrome," where the person tends to be short and have thick skin and inflexible joints.

"That's important to know from the standpoint of diagnosis and informed treatment," Valle says. "But it also tells us an enormous amount about how biology works. How is it that one set of variants in this gene causes one condition and a different set of variants in the same gene causes a very different condition?"

This brings us back to tomatoes. While Schatz acknowledges that genome sequencing is still largely in the discovery stage, the 100-tomato project enabled the researchers to discover which variants are related to which tomato characteristics.

"Now we understand some of the genetic mechanisms that control the size and weight of the fruit, the flavor. The shape of the leaves. That's the immediate impact," he says. "What we hope is that now that we understand these traits genetically, we'll be able to work with breeders to make more productive fruits, tomatoes that are more flavorful and more nutritious."

At the same time, the computational method Schatz has developed is driving an ambitious search for clues to reveal what within a gene makes a person more likely to have a particular disease or trait. The more data that can be analyzed, the better the chances of zeroing in on a causal relationship.

"You can't just trust a single measurement," he says. "It's like the old carpenter's rule—measure twice, cut once. On a single-person, single-sequence level, it might be a very weak signal. But by combining info from multiple sequences, that's where we can become more precise."

The goal is to identify and catalog as comprehensively as possible which variants in genes are directly related to a medical condition. "We can be more predictive," he says. "We can go screen people who may not have signs of a disease, but we can figure out their risk."


Think you know tomatoes? Here are five things you may not have known:

  • A tomato has almost 32,000 genes—more than a human.
  • There are an estimated 15,000 varieties of tomatoes.
  • Most wild tomatoes are the size of a marble.
  • Tomatoes first came from Peru. Spanish explorers took them back to Europe in the 16th century.
  • The first documented tomato was yellow and many-lobed, like a pumpkin.

There's much to be said for the prospect of bringing such a high level of precision and predictability to medical diagnosis and treatment. But it's not without its ethical quandaries. It's one thing to modify a tomato genome to create a tastier fruit, but it's quite another to tinker with a human genome to produce a baby with blue eyes and exceptional language skills.

Alexis Battle, a Johns Hopkins associate professor of biomedical engineering who has worked with Schatz on several research projects, has given that a lot of thought.

"We have to consider the implications of everything we learn about the genome," she says. "If we get really, really good at predicting traits from inherited genetic sequences, where does that ultimately lead? Will it lead to people wanting to modify genomes of future children? And where does that lead?"

Valle raises another issue. "When you sequence a patient's genome and you find out something that's medically relevant, well, they share parts of that genome with their relatives," he says. But the physician may not have a doctor-patient relationship with those other family members. How, he asks, should a doctor interact with relatives who also may have a genetic predisposition to a medical condition? And, he wonders, do people want to know about something doctors can't really treat? "You can never be absolutely sure how someone will respond," he said. "Every patient is different."

For Salzberg, it's important not to confuse the technology with how it might be used.

"Yes, there are ethical consequences in tinkering with human cells," he says. "But the technology doesn't have any ethics. It's like saying a hammer has ethics. You can kill someone with a hammer, but you didn't invent a hammer to do that."

Salzberg acknowledges that it's important to understand that a person can use any tool for nefarious purposes and that safeguards are implemented to prevent that. "But we're still working on the tools," he says. "We know they have great potential for benefit. That's why we're developing them. So, let's see how accurate we can make the technology. Let's see how far we can get."

Schatz acknowledges the concerns about gene engineering, whether it involves humans or food crops. But he also sees genome sequencing as a method for applying science in a profound way to a process inherent in all living things. He points to the example of a large cornfield. The many mutations occurring in the field are random. The corn ears are various shapes and sizes. Plants respond differently to climate and soil conditions.

"Now we can go in and in a very careful, focused way create mutations that we believe will be beneficial. Our understanding of the genomes has really accelerated in recent years and now we can do this quite successfully."

He adds, "I think our best response is to talk about the precision of the process and explain that making an edit to a genome doesn't turn something into a monster."

Schatz draws an analogy to adding apps to a smartphone. "Adding another app will not break your phone," he says. "In the same way, we have that much control over a genome. We can install new code. And we know what it will do. It's adding new capabilities."

The desire to identify the genetic blueprints for the range of characteristics within one species was a principal objective of the tomato sequencing project. To that end, the research team placed a high priority on diversity when they chose which plants to analyze. Some were commercially bred tomatoes you would find in a grocery store. Others were centuries-old varieties that still grow in the wild in South and Central America.

"We wanted to cast a wide net where we would have as much variability in the fruits as possible," he says. That improves the chances of being able to pinpoint which genes or variants are linked to which tomato traits.

From there it becomes a matter of editing genes to mimic those traits. Unlike genetically modified food where genes from other species, such as bacteria, are introduced into a crop to resist damage by pests or pathogens, the tomatoes are modified only with genes from other tomatoes, Lippman explains.

"Genome editing is working with the DNA that's already in the plants," he says. "We're not trying to bring in a silver bullet from a nonplant organism to address a problem or increase yield. What we're doing is integrating and bowing down to what we're learning from nature.

"Breeders in the old days would say, 'I'm going to cross a wild species with my elite hybrid tomato to bring in a certain trait,'" Lippman adds. "The problem with that is that you shuffle the deck. The process with which you try to bring in that desirable trait from the wild is very slow, convoluted, and expensive, and it may not bring you back to where you were before you shuffled the deck.

"We're taking what nature's given us and we're turning dials."

Lippman says genome editing could be an important tool in helping crops adapt to climate change. He's looking to sequence more wild ancestors of the tomato that have endured in drought conditions or places that get a lot of rain.

As the desire to learn more about the molecular workings of living things builds, Schatz hopes to be able to keep providing the algorithmic tools that make it possible.

"In some cases, we know what the right genes are, but we're looking at them in the wrong way," he says. "That's where the computer science kicks in to enable us to write software systems that can make sense of what we're seeing.

"Every genome brings its own story. We look at where we can have the most impact in the world."

Randy Rieland is a writer based in Washington, D.C.