Researchers have published the most complete and highest-quality set of reference genomes for 16 vertebrate species, creating a resource that could help scientists address fundamental questions in biology, medicine and biodiversity conservation.
The study, published Wednesday in a special issue of Nature, summarizes the recent efforts of the Vertebrate Genomes Project, or VGP, an international initiative to generate accurate and complete genome assemblies of all living vertebrate species. More than a mere academic exercise, such a collection would prove invaluable for future researchers.
“It's like a cookbook or an instructional manual for every living organism,” said author Sadye Paez, a senior research associate at the Rockefeller University. “By having this cookbook, we can then start to look at, what are the genes that encode their life? So we can then start to ask questions about biology, human health and disease [and] conservation.”
One example of these applications is using bats to study immunity, as was done in an earlier VGP study. Bat immunity is of concern to humans because the animals have been implicated as a likely source of the outbreak that started the COVID-19 pandemic.
“Why are bats so resistant to infectious diseases; what is it about their genome that gives them this resistance that they can harbor these terrible diseases like coronavirus, but they don't get sick from it or they don't die from it?” Paez asked. "What is it about their genome that's different from ours? What can we learn from that?”
However, genomic data for vertebrates is often incomplete or insufficient due to technological barriers in sequencing. This was a problem that Erich Jarvis, lead of the VGP sequencing hub at The Rockefeller University in New York, sought to solve in 2015 when he took a leadership role in the G10K consortium, the international group of scientists that leads the VGP.
“I emphasized the need to work with technology partners and genome assembly experts on approaches that produce the highest-quality data possible, as it was taking months per gene for my students and postdocs to correct gene structure and sequences for their experiments, which was causing errors in our biological studies,” Jarvis said. “For me this was not only a practical mission, but a moral imperative.”
Before the researchers were able to create a wider library, they had to evaluate which sequencing and assembly approaches would work best. They tried many different approaches on a single species, the Anna’s Hummingbird.
“We did what we call sort of the kitchen sink approach: Throw everything at an organism to see what does it look like,” said Paez.
They found that the optimal approach involves using automated assembly methods, but adding a step of manual curation in which assembled sequences are checked for completeness and errors. These results could then inform better assembly algorithms.
The result was that errors were caught sooner and not incorporated into the final genome. These included false genes, entire missing chromosomes and other areas not captured by older assembly methods.
“Previous assemblies were missing between 30% to 50% of GC-rich protein coding gene regulatory regions,” Paez said, referring to important regions of the genome involved in protein production. “They’re sort of called the dark matter of the genome. And so our assemblies have corrected for that.”
Now that the assembly method has been refined, the VGP is putting it to great use. Including the 16 published in the study, the researchers now have 129 genomes submitted to the National Center of Biotechnology Information, a massive, publicly available database hosted by the NIH. Their next goal is to assemble 268 genomes, one from each vertebrate order.
The researchers also hope that some of their methods may be applied to other organisms including invertebrates, plants and microorganisms, though there are unique challenges to genomes of different species that will require further study.
However, vertebrates will provide more than enough data in the meantime. In the long term, the researchers have their sights set on every vertebrate species, a library of some 71,000 genomes. And while the project has taken five years to date, the team is confident that further advances in technology will reduce the time it takes to sequence such a vast number of species.
“The goal is to do all of them within the next within 10 years,” Paez said. “There's a couple of things that are helping with that. One, the technology is improving drastically and it keeps improving, so that helps. The second thing is that there's now a lot of other consortium groups that have been modeling after the VGP. I think these two things in [tandem] are going to help us sort of accelerate this goal.”
The study, “Towards complete and error-free genome assemblies of all vertebrate species,” published April 28 in Nature, was authored by Arang Rhie, Sergey Koren, Brian P. Walenz and Adam M. Phillippy, National Human Genome Research Institute; Shane A. McCarthy, Iliana Bista, Dengfeng Guan and Richard Durbin, University of Cambridge; Olivier Fedrigo, Giulio Formenti, Gregory L. Gedman, Lindsey J. Cantin, Bettina Haase, Jacquelyn Mountcastle, Sadye Paez, Matthew T. Biegler, Constantina Theofanopoulou and Erich D. Jarvis, The Rockefeller University; Joana Damas and Harris A. Lewin, University of California, Davis; Marcela Uliano-Silva, Leibniz Institute for Zoo and Wildlife Research; William Chow, Michelle Smith, Milan Malinsky, Zemin Ning, Ying Sims, Joanna Collins, Sarah Pelan, James Torrance, Alan Tracey, Jonathan Wood and Kerstin Howe, Wellcome Sanger Institute; Arkarachai Fungtammasan, Maria Simbirsky and Brett T. Hannigan, DNAnexus Inc.; Juwan Kim, Chul Lee, Byung June Ko and Heebal Kim, Seoul National University; Mark Chaisson and Robel E. Dagnew, University of Southern California; Francoise Thibaud-Nissen, Jinna Hoffman, Patrick Masterson and Karen Clark, National Library of Medicine; Leanne Haggerty, Fergal Martin, Kevin Howe and Paul Flicek, European Molecular Biology Laboratory; Sylke Winkler, Martin Pippel, Ekaterina Osipova and Eugene W. Myers, Max Planck Institute of Molecular Cell Biology and Genetics; Jason Howard, Novogene; Sonja C. Vernes, Max Planck Institute for Psycholinguistics; Tanya M. Lama, University of Massachusetts Cooperative Fish and Wildlife Research Unit; Frank Grutzner, University of Adelaide; Wesley C. Warren, University of Missouri; Christopher N. Balakrishnan, East Carolina University; Dave Burt, University of Queensland; Julia M. George and David F. Clayton, Clemson University; David Iorns, The Genetic Rescue Foundation; Andrew Digby and Daryl Eason, Kākāpō Recovery; Bruce Robertson, University of Otago; Taylor Edwards, University of Arizona Genetics Core; Mark Wilkinson, Natural History Museum; George Turner, Bangor University; Axel Meyer, Andreas F. Kautt, Paolo Franchini and Robert H. S. Kraus, University of Konstanz; H. William Detrich III, Northeastern University; Hannes Svardal, University of Antwerp; Maximilian Wagner, Karls-Franznens University of Graz; Gavin J. P. Naylor, University of Florida; Mark Mooney, Tag.bio; Trevor Pesout, Erik Garrison, Hiram Clawson, Mark Diekhans, Luis Nassar, Benedict Paten, Beth Shapiro and David Haussler, University of California, Santa Cruz; Marlys Houck, Ann Misuraca and Oliver A. Ryder, San Diego Zoo Global; Sarah B. Kingan, Richard Hall, Zev Kronenberg, Ivan Sović, Christopher Dunn and Jonas Korlach, Pacific Biosciences; Alex Hastie and Joyce Lee, Bionano Genomics; Siddarth Selvaraj, Arima Genomics; Richard E. Green and Jay Ghurye, Dovetail Genomics; Nicholas H. Putnam, independent researcher; Ivo Gut, Barcelona Institute of Science and Technology; Sarah E. London, University of Chicago; Claudio V. Mello, Samantha R. Friedrich and Peter V. Lovell, Oregon Health and Sciences University; Farooq O. Al-Ajli, Monash University Malaysia Genomics Facility; Simona Secomandi, University of Milan; Michael Hiller, LOWE Centre for Translational Biodiversity Genomics; Yang Zhou and Guojie Zhang, BGI-Shenzhen; Robert S. Harris, Kateryna D. Makova and Paul Medvedev, Pennsylvania State University; Woori Kwak, eGenome Inc.; Andrew J. Crawford, Universidad de los Andes; M. Thomas P. Gilbert, University of Copenhagen; Byrappa Venkatesh, A*STAR; Robert W. Murphy, Royal Ontario Museum; Klaus-Peter Koepfli and Warren E. Johnson, National Zoological Park; Frederica Di Palma, University of East Anglia; Tomas Marques-Bonet, Institute of Evolutionary Biology (UPF-CSIC), Emma C. Teeling, University College Dublin; Tandy Warnow, University of Illinois at Urbana-Champaign; Jennifer Marshall Graves, La Trobe University; and Stephen J. O’Brien, ITMO University.