Data Deluge and the Human Microbiome Project

By Mark Sagoff

Because the cost of genetic sequencing has declined so much, researchers are accumulating oceans of data for no clear purpose. The assumption is that something in the data will stimulate important questions, but this is not an effective way to conduct scientific research.

A specter is haunting science: the specter of data overload. All the powers of the scientific establishment have entered into a holy alliance to exorcise this specter: the National Institutes of Health (NIH), the National Science Foundation (NSF), and the Department of Energy, among others. What funding agency has not called for novel software to distill meaning from a torrent of data – for example, from the 700 megabytes (Mb) of data per second produced by the Large Hadron Collider, the 1,600 gigabytes (Gb) generated each day by NASA’s Solar Observatory, the 140 terabytes (Tb) to flow every day om the Large Synoptic Survey Telescope, or the 480 petabytes (Pb) expected daily from the Square Kilometer array? To deal with “the fast-growing volume of digital data,” the Obama administration on March 29, 2012, announced a Big Data Initiative, with $200 million in new commitments by research agencies and an additional $250 million investment by the Department of Defense to “improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data.”

In the biological sciences, the data deluge has become most suffocating. Technological advances make it continually faster and cheaper to produce genomic sequence data than to store, manage, and analyze them. To keep up with the flow of data, some biologists have called for a change in the scientific method from the traditional practice that poses a hypothesis and then pursues the phenomena necessary to test it to a novel approach that uses mathematical tools to scan data for interesting associations. Eric Lander of the Broad Institute wrote in Nature magazine that the greatest impact of data-rich genomics “has been the ability to investigate biological phenomena in a comprehensive, unbiased, hypothesis-free manner.” Chris Anderson in Wired magazine wrote an article titled “The Data Deluge Makes the Scientific Method Obsolete.” Anne Thesson and David Patterson at the Marine Biological Laboratory in Woods Hole have called for the emergence of a “Big New Biology, focused on aggregating and querying existing data in novel ways.”

The sorcerer’s apprentice

Of the many instances of data overload, the press has given most attention to gene sequencing, which identifies the pattern of base pairs of nucleotides in a DNA fragment. In an article titled “Will Computers Crash Genomics?” Elizabeth Pennisi, writing in Science magazine in February 2011, reported that sequencing centers have produced data sets so large they have to be mailed physically on disks and drives because it could take weeks to transfer them electronically over the Internet. “A single DNA sequencer can now generate in a day what it took 10 years to collect for the Human Genome Project,” Pennisi wrote. She quoted a Canadian bioinformaticist, Lincoln Stein, who predicted that the torrent of DNA data “will swamp our storage systems and crush our computer clusters.”

“The field of genomics is caught in a data deluge,” Andrew Pollack wrote on November 30, 2011, in the New York Times. The story quotes C. Titus Brown, a bioinformatics specialist at Michigan State University. According to Brown, the NIH-sponsored Human Microbiome Project (HMP), which samples and sequences microbial populations found in the human gut and other bodily sites, has already generated about a million times as much sequence data as did the initial Human Genome Project. “It’s not at all clear what you do with that data,” he said.

Eric Green, director of the National Human Genome Research Institute (NHGRI), expressed a similar concern in February 2011 in a videotaped lecture. Green used a picture of a boy trying to drink from a fire hose and the famous image of the great wave by the 19th Century Japanese artist Hokusai to illustrate what he described as a “large onslaught of data sets” and a “massive tsunami” associated with the HMP and other sequencing activities. Green wrote with a coauthor in Nature, “Computational tools are quickly becoming inadequate for analyzing the amount of genomic data that can now be generated, and this mismatch will worsen.” The website of just one of many NIH sequencing initiatives—the 1000 Genomes Project—says that its data set “is currently around 130 Tb in size and growing.”

Although biologists may be uncertain how to respond to the data tsunami, they agree about its cause. A fascinating NIH webpage (http://www.genome.gov/sequencingcosts/) describes the declining “cost of determining one megabase (Mb; a million bases) of DNA sequence of a specified quality.” Advances in gene sequencing technology, many of which were prompted by NIH, have pushed down the cost from about $1,000 per Mb in January 2008 to about 10 cents today. This dime-a-Mb price includes: “Labor, administration, management, utilities, reagents, and consumables; sequencing instruments … informatics activities directly related to sequence production … ; submission of data to a public database;” as well as “indirect” costs. Who can resist a bargain like that?

The cost of storage, maintenance, and transfer, however, forms a bottleneck that keeps data from potential users. According to a principle known as Kryder’s Law, the price of data storage falls by half every 14 months. Matthew Dublin at Genome Technology magazine has written, “A t present, the per-base cost of sequencing is dropping by about half every five months, and this trend shows no sign of slowing down. … Factor in all of these variables, and the logarithmic graphs start looking like signs of a data-management doomsday.”

Data-management doomsday appeared to arrive in February 2011 when the National Center for Biotechnology Information (NCBI) announced that because of budgetary constraints, it would phase out its Sequence Read Archive (SRA) and other database resources. This was remarkable because the HMP in its Data Release and Resource Sharing Guidelines requires that all sequence data along with annotations and identifications be submitted to NCBI on a weekly basis. The decision to close NCBI was reversed, but for a time it seemed that genomic sequence data, like nuclear waste, would have to be stored on site until a national depository could be found. With hypothesis-driven science, investigators are often able to obtain whatever data they need either in their own laboratories or through one of many commercial DNA sequencing services. Why search through difficult public databases when it is so inexpensive to do your own sequencing when you know why you are doing it? In 2011, for example, a deadly outbreak of enterohemorrhagic Escherichia coli in Germany sickened almost 4,000 people, about 50 of whom died. Early in the outbreak, biologists, using a commercial sequencing service, took only a few days to identify the culprit E. coli strain and trace it to its source.

Data tsunamis build when researchers sequence first and look for questions to ask later. Is it possible to identify a logical stopping place for generating genomic data? The number of different microbiota in and around human beings, for example, is for all practical purposes infinite. The deluge rises in the absence of a framework for deciding which ones to sequence and for containing, organizing, and interpreting the data that result.

The blog of one major HMP participant, the J. Craig Venter Institute, reported on early results from “700 samples from hundreds of individuals taken from up to 16 distinct body sites.” The “data produced from the sequences exceeds 10 terabytes,” which had to be stored. “Ultimately researchers want to relate this information to healthy versus disease states in humans,” the Venter Institute blog hopefully opined, although “ultimately” does not suggest when or how this might be done. In an interview with Science magazine on June 6, 2012, George Weinstock, a principal investigator for the HMP, said, “Despite the huge amount of the work that has been done on the human microbiome, the number of rigorously proved connections between disease and microbiome are few to none.”

In a famous poem, “Der Zauberlehrling” (“The Sorcerer’s Apprentice”), Goethe describes a wizard who, as he leaves his workshop, assigns his apprentice chores, including bringing water from a well. The apprentice enchants a broom to fetch water for him, but he cannot stop the broom as it endlessly repeats its task and floods the workshop. The apprentice hacks the broom in pieces, but each piece becomes a new broom and brings more water. By forcing down the cost of gene sequencing, NIH and others have created magic brooms that endlessly fetch genomic data. At BGI, a large sequencing center, Bingqiang Wang lamented, “We are drowning in the genome data that our high-throughput sequencing machines create every day.”

The Human Microbiome Project

The HMP illustrates the problem that data deluge poses when a context or framework for interpreting those data is lacking. According to Lita Proctor, the working group coordinator, the HMP was “specifically devised and implemented to create a set of data, reagents, or other material whose primary utility will be as a resource for the broad scientific community.” Officially launched by NIH in October 2007 as a $157 million, five-year effort, the project seeks to “characterize the microbial communities found at several different sites on the human body, including nasal passages, oral cavities, skin, gastrointestinal tract, and urogenital tract.” The microbiota in and on a healthy human are thought to contain 10 times as many cells as the human body itself and to include bacteria, viruses, archaea, protozoans, and fungi. According to Proctor, the HMP will survey “a cohort of healthy adults to produce a reference dataset of baseline microbiomes.” But as microbiologist George Weinstock, associate director of the Genome Institute at Washington University in St. Louis, pointed out in a presentation, “Probably [there is] not a ‘reference’ microbiome.” The project will collect “sequences of reference strains,” although the idea or purpose of a “reference strain” is not defined.

There are genomics projects that are complex and challenging but still rely on well-understood rules of induction to try to answer a specific question, such as which genetic mutations cause Mendelian (monogenetic) disease. A more ambitious effort, the genome-wide association study (GWAS), involves scanning the genomes of many people to find variations associated with phenotypic differences while controlling for environmental factors. Phenotypic traits may resist definition (“asthma,” for example, could refer to a great number of problems), and environmental factors (“lifestyle,” for example) may confound it. Even if it is hard to control for genotypic, phenotypic, and environmental variance, however, at least one has some idea of what these concepts mean in GWAS research.

With the HMP it is different. Distinctions between genotype and phenotype or between genomic and environmental factors are impossible to understand conceptually much less control experimentally. It is not clear, for example, whether the microbiome should be merged with the human genome or considered part of its environment. Are the phenotypic traits of the associated bacteria human traits or not? Are the genomes of the microbiota part of the human genome, whereas the organisms themselves are part of its environment? Questions such as these are so imponderable, so up for grabs, that they are not worth asking. In evolutionary biology, the concept of a reference organism is entirely clear; it provides the model to which developmental and other phenomena in similar organisms are compared. The HMP speaks in terms of a “reference set” of perhaps 3,000 genetic sequences but is not clear what “reference” in this context may mean.

To create a “reference set of microbial gene sequences,” the HMP began with a “jumpstart” phase that funded four largescale sequencing centers. The announcement stated, “This initiative will begin with the sequencing of up to 600 genomes from both cultured and uncultured bacteria, plus several non-bacterial microbes.” As costs fell, the number of reference microbial genomes rose; the HMP Working Group stated that the project “will add at least 900 additional reference bacterial genome sequences to the public database.” In a more recent document, Proctor refers to a “target catalog of 3,000 microbial genome sequences.”

The HMP jumpstart phase has made rapid progress not only in reaching its target of 600, 1,000, or 3,000 reference gene sequences but also, to quote the initial announcement, in continuing “with metagenomic analysis to characterize the complexity of microbial communities at individual body sites.” Metagenomic analysis involves sequencing mixed fragments of DNA or RNA detected in samples of organic material. In one European study, researchers produced 576.7 Gb of sequences detected in stool samples. The research team discovered 3.3 million nonredundant microbial genes, primarily bacterial, associated with what they dubbed as “the fecal microbial community.” This outnumbers by 150 times the genes identified in the human genome proper. Opportunities for further sequencing abound. According to another study, “For every bacterium in our body, there’s probably 100 phages, with an estimated 10 billion of these viruses packed into each gram of human stool.”

To get some conceptual purchase on the HMP, its Working Group asked “whether there is a core microbiome at each body site.” There has been a lively debate over whether this is a meaningful question. A group of geneticists reported in Genome Biology in 2011 that “there is pronounced variability in an individual’s microbiota across months, weeks, and even days. Additionally, only a small fraction of the total taxa found within a single body site appear to be present across all time points, suggesting that no core temporal microbiome exists.” It appears safe to say that bacteria of the broad phyla Bacteroidetes and Firmicutes are found in every individual. Beyond this, according to Yale microbiologists Ashley Shade and Jo Handelsman, “what constitutes a core remains elusive.” Microbiologist Julian Marchesi opines that “when we drill down the taxonomic levels it seems that this concept becomes more sketchy and different studies and methods provide different answers.”

According to a study published in 2005, “the bacterial communities in the human gut vary tremendously from one individual to the next.” An article published in Nature in 2010 lists several studies that have shown “substantial diversity of the gut microbiome between healthy individuals.” Even identical twins present an amazing diversity in the microbiota that accompany them. A metagenomic analysis of fecal microbiota “revealed an estimated 800–900 bacterial species in each co-twin, less than half of which were shared by both individuals.”

The HMP announcement states, “Initially, 16S rRNA gene sequencing will be used to identify the microbiome community structure at each site.” This genetic sequence encodes part of the cellular machinery responsible for synthesizing proteins in bacteria and archaea and is fairly well conserved when microbes divide. So far, HMP “has produced a 2.3-terabyte 16S ribosomal RNA metagenomic data set of over 35 billion reads taken from 690 samples from 300 U.S. subjects, across 15 body sites,” according to a recent report in Nature Reviews Genetics. That a microbiome community (for example, the “fecal microbial community”) can be delimited and defined by this large-scale sequencing project cannot be assumed. Even if there is a community structure, it is not clear that 16S RNA sequencing will identify it. In the genome of any given microbe, a dozen or more copies of the 16S gene may be found, with significant nucleotide differences among them. Microbes that differ genetically in dramatic ways, moreover, can contain 16S genes that are identical or nearly the same.

“The genus Bacillus is a good example of this,” microbiologists J. Michael Janda and Sharon L. Abbott explain. The 16S rRNA genes associated with strains of B. globisporus and B. psychrophilus are more than 99.5% the same, but at the genomic level these strains show little relatedness. Other researchers comment that because of the limited nucleotide variability in the 16S gene in bacteria, “taxonomic assignment of species present in a mixed microbial sample remains a computational challenge.”

The assignment of species, however, may be incidental to the problem of understanding what is meant by concepts such as “community” and “structure” when applied to a fecal sample or to any collection of organic matter that may be subject to metagenomic analysis. More challenging may be the problem of determining how many discrete microbiomes occupy a bodily site. The mouth, for example, harbors any number of unique habitats, each with its own distinct populations. There does not seem to be a way to count how many microbiomes are there.

The HMP in its initial phase has succeeded in producing very large data sets, but it has yet to provide a conceptual framework beyond tagging sequences to the general locations where they are found. According to University of Michigan epidemiologist Betsy Foxman and coauthors, “A s the HMP moves forward, it would benefit from the development of an overall conceptual framework for structuring the research agenda, analyzing the resulting data, and applying the results in order to improve human health.” Lita Proctor has described the same challenge. “Given that variation in the microbiome appears to be far greater than human genetic variability, repeated studies in each target population will be needed to identify keystone microbiome signatures against a complex and contextually dependent background.” The project, however, has yet to suggest meanings for concepts such as “keystone microbiome signatures” and to agree on ways to identify and re-identify microbiomes, if these exist, as entities through time and change.

Are we ecosystems?

In 2007, a team of microbiologists introduced the HMP in an inaugural article published in Nature magazine. “If humans are thought of as a composite of microbial and human cells,” they wrote, “then the picture that emerges is one of a human ‘supraorganism.’” A companion paper similarly presented the HMP in ecological terms. “Humans and their collective microbiota are segmented into many local communities, each comprising an individual human,” it stated. “This ecological pattern, characterized by strong interactions within distinct local communities and limited interactions or migration between them, is described as a metacommunity.” In her 2011 report on the progress of the HMP, Proctor likewise refers to the “human superorganism.”

The leaders of the HMP almost universally appeal to ecological concepts and metaphors to provide a conceptual framework for their research. The inaugural HMP article in Nature declared, “Questions about the human microbiome are new only in terms of the system to which they apply. Similar questions have inspired and confounded ecologists working on macro-scale ecosystems for decades.” The article continues, “It is expected that the HMP will uncover whether the principles of ecology, gleaned from studies of the macroscopic world, apply to the microscopic world that humans harbor.” Likewise, in a 2011 manifesto titled “Our Microbial Selves: What Ecology Can Teach Us,” a group of microbiologists working with the HMP proposed to “answer fundamental questions that were previously inaccessible” by using “well-tested ecological theories to gain insight into changes in the microbiome.”

The absence of a conceptual framework for interpreting HMP data becomes apparent when one asks which principles and well-tested theories in ecology can provide insight into the human microbiome. Ecologists do not have a settled idea of the “function” of an ecological community or system. They may point out that some apparently “functional” ecosystems, such as salt marshes, are monocultures. The microbiologists state, “Community ecologists are interested in what controls patterns in diversity and the dynamics of consortia in the same environment.” Ecologists have never found consensus that such patterns exist, however, nor identified any forces that control them. Environmental historian Donald Worster has written, “Nature should be regarded as a landscape of patches, big and little, patches of all textures and colors, a patchwork quilt of living things, changing continually through time and space, responding to an unceasing barrage of perturbations. The stitches in that quilt never hold for long.”

According to bioethicist Eric Juengst, HMP scientists believe “that the human body should be understood as an ecosystem with multiple ecological niches and habitats” and that “human beings should be understood as ‘superorganisms’ that incorporate multiple symbiotic cell species into a single individual with very blurry boundaries.” The architects of the HMP, Juengst has written, “describe the individual human body as itself an ecosystem.” Researchers “almost universally declare human beings to be ‘superorganisms’ rather than discrete biological individuals, rendering our personal boundaries fluid and flexible.” This fundamentally changes how the patient in a medical context is portrayed, not as an individual but as an ecosystem.

How well does the ecological analogy work in medicine? Microbiologists at the University of Colorado involved in microbiome research have opined, “Diversity might also have a crucial role in ecosystem health by contributing to stability.” Has microbial diversity a role in human health? Are more microbes of more kinds better for you? The stability-diversity hypothesis and the idea of “ecosystem health” have been so roundly criticized by ecologists that they have largely abandoned these concepts. According to three prominent ecologists, Volker Grimm, Eric Schmidt, and Christian Wissel, “The term ‘stability’ has no practical meaning in ecology.” Indeed it cannot have any meaning because ecology lacks identity conditions for ecosystems—that is, criteria by which to determine when a site remains the same or becomes a different ecosystem though time and change. This suggests a difference between humans and ecosystems. A patient may die but the superorganism or metacommunity lives on, as would any ecosystem, even if perturbed. Death may increase biotic diversity and therefore ecosystem health in the microbiome of the cadaver. Change happens. It’s all good. As Juengst points out, “there are no bad guys in ecosystems.”

The HMP has a predecessor in ecology. As a commentary in Science points out, “For a 7-year period ending in 1974, the United States participated in the International Biological Program (IBP)—an ambitious effort that was supposed to revolutionize … ecology and usher in a new age of ‘Big Biology.’” The IBP, which received about $60 million from NSF and more funding from international agencies, attempted to survey or census the biota in “six biomes: the tundra, coniferous and deciduous forests, grassland, desert, and tropical vegetation,” according to botanist Paul Risser in 1970. “The data from all the sites is sent to Colorado State University where initial analysis and summary takes place,” Risser added. “Eventually we will translate these … statements into computer languages to permit simulation and optimization analysis.”

The purpose of the IBP was not to test a hypothesis or answer a specific question but to provide a resource for the scientific community. It intended to “determine the biological basis of productivity and human welfare.” This proved difficult because the 1,800 U.S. scientists who engaged in IBP research discovered in each “biome” a collection of thousands of patches, each with a different and transient assortment of species. The concept of a biome failed to provide a conceptual framework for organizing the plentiful data the IBP produced. There were no non-arbitrary bounds or biota to define a biome; sites changed from season to season and day to day. The data sets may still languish at Colorado where they were sent, or they may have disappeared.

Ethical questions

A salient and pressing ethical, legal, and social issue confronting the HMP has to do with the maintenance of the data sets it produces in view of the absence of a conceptual framework for interpreting them. Two biologists at NCBI have written, “A n interesting, perhaps provocative question is whether a sufficient number of genomes have already been sequenced.” They speculate “that microbial genomics has already reached the stage of diminishing returns, such that each new genome yields information of progressively decreasing utility.” In the absence of a conceptual framework other than vague ecological metaphors such as “superorganism,” one may ask if society has an ethical duty to keep the tsunami of data the HMP produces.

Ewan Birney, a bioinformationist at the European Bioinformatics Institute, has said that because the cost of generating data falls much faster than the cost of storing it, “there will come a point when we will have to spend an exponential amount on data storage.” A recent estimate pegs the price for “cloud” storage at 14 cents per gigabyte per month. The National Library of Medicine (NLM) 2012 budget requested $116 million to support NCBI; funding is “specifically added … to meet the challenge of collecting, organizing, analyzing, and disseminating the deluge of data emanating from NIH-funded high-throughput genomic sequencing initiatives.” Should society pay to store and manage HMP data until a conceptual framework can be found for interpreting them?

Are data entitled to be preserved? Science writer Matthew Dublin has proposed that the data tsunami challenges scientists who believe data are a sacred responsibility. The public, especially when budgets are tight, may grow weary of the cost. Scientists who test a hypothesis or answer a question they care about have an incentive to keep their data. This may not be as true of data created primarily as a resource for the broad scientific community.

One may identify two extreme positions as bookends between which to locate an ethically defensible response in the context of the HMP to the data deluge or tsunami problem. On the one hand, NIH could write off the HMP as a sunk cost and leave it to the scientific community to decide which data it wants to keep as a resource. In other words, one could say that NBCI had it right when it threatened to close. To find the needle in the haystack, one does not add more hay. It is worse: Needles turn into hay when looked at differently, and hay turns into needles. Instead of sequencing first and looking for questions later, NIH may do better, in the words of Green and Guyer, to support “individual investigators to pursue more effective hypothesis-driven research.” Let the hypothesis decide what data it needs. There is a lot of sequencing capacity out there and a practically infinite number of microbes to sequence. If the point is just to keep the sequencers busy, then the HMP represents mission creep in the Human Genome Project.

On the other hand, NIH could double down on its investment by sequencing more and more microbes and metagenomes in the hope that large enough data sets will speak for themselves and yield insights in response to the principles of ecology and other algorithms. This approach calls not only for a conceptual framework that does not yet exist, but for a philosophy of science that abandons hypothesis-driven research. According to a group of biologists writing in BioScience, data-intensive science does not test theories, models, or hypotheses but “requires new synthetic analysis techniques to explore and identify … truly novel and surprising patterns that are ‘born from the data.’” An HMP website agrees: “The data sets produced by metagenomic sequencing and related components will be very large and complex, requiring novel analytical tools for distilling useful information from vast amounts of sequence data, functional genomic data and subject metadata.”

One may question the view of its founding advocates that the “HMP is a logical, conceptual, and experimental extension of the Human Genome Project.” Spatial contiguity, often transitory, as when your dog licks your hand, relates a human genome to microbiota, but this is not a logical, conceptual, or experimental extension. Individuals vary genetically in ways that show no correlation with the ways in which their microbiota vary. A group of biologists did observe, however, that the HMP “is following in the footsteps of the Human Genome Project … [in its] potential disappointment and resentment over the lack of medical applications.”

At the other extreme, one may respond that society has an obligation, to quote the NLM again, “to meet the challenge of collecting, organizing, analyzing, and disseminating the deluge of data emanating from NIH-funded high-throughput genomic sequencing initiatives.” The HMP Working Group has written, “Computational methods to process and analyze such data are in their infancy, and, in particular, objective measures and benchmarks of their effectiveness have been lacking.” The massive investment in producing data sets, according to this view, is not a sunk cost but a justification for more investment in computational methods, since without them the value of the data, which was produced in anticipation of these algorithms, will be lost.

The HMP may find itself in the position of the miller in the famous German fairy tale who, to gain influence with the king, said his daughter could spin straw into gold. The king provided a spinning wheel and plenty of straw. Fortunately, the girl in the fairy tale overheard an imp who while dancing chanted a novel, integrative, synthetic, computational algorithm for spinning straw into gold. When an analogous informatics becomes available for the HMP, we may call it the Rumpelstiltskin algorithm.

A possible way to hasten this transition might be to privatize as a nonprofit outfit the Data Analysis and Coordination Center (DACC), which is the central repository for all HMP data. The DACC would then charge for the use of data a fee that represents some part of the cost of storing and making it accessible. Researchers could then decide whether it is cheaper to do their own sequencing and annotate it in their own way or to download the data. If algorithms appear that turn Big Data into Big Biology, the DACC will support itself by the fees it charges. If interest in the data set is too little to meet the costs of maintaining it, however, one may wonder where to put it.

According to Nature Medicine, “Each day, approximately 10 terabytes of data stream out of more than 90 gene sequencers at The Broad Institute” at Harvard and MIT, one of four major sequencing centers funded by NIH to jumpstart the HMP. In a podcast associated with a talk she gave in September 2011, Toby Bloom, director of informatics at the Broad Genome Sequencing Platform, commented, “as the data get older and the technology gets older, … our data isn’t just two years old, it’s four years old or six years old.

And what do we do with that older data? What of it do we have to keep, and what do we do about the costs?” Bloom considered the storage issues tractable. “Dealing with the size of the data is no longer the thing that keeps me up at night … What I want to address is, where do we want to go with all this data?”

Search Issues

Data Deluge and the Human Microbiome Project

The sorcerer’s apprentice

The Human Microbiome Project

Are we ecosystems?

Ethical questions

Join the Conversation