Building a Data Infrastructure for the Bioeconomy
A new generation of interoperable data systems can secure public health and guide public policy in a future of rapid environmental change, sophisticated biological threats, and an economy enabled by biotechnology.
While the development of vaccines for COVID-19 has been widely lauded, other successful components of the national response to the pandemic have not received as much attention. The National COVID Cohort Collaborative (N3C), for example, flew under the public’s radar, even though it aggregated crucial US public health data about the new disease through cross-institutional collaborations among government, private, and nonprofit health and research organizations. These data, which were made available to researchers via cutting-edge software tools, have helped in myriad ways: they led to identification of the clinical characteristics of acute COVID-19 for risk prediction, assisted in providing clinical care for immunocompromised adults, revealed how COVID infection affects children, and documented that vaccines appear to reduce the risk of developing long COVID.
N3C has created the largest national, publicly available patient-level dataset in US history. Through a unique public-private partnership, over 300 participating organizations quickly overcame privacy concerns and data silos to include 13 million patient records in the project. More than 3,000 participating scientists are now working to overcome the particular challenge faced in the United States—the lack of a national healthcare data infrastructure available in many other countries—to support public health and medical responses. N3C shows great promise for unraveling answers to questions related to COVID, but it could easily be expanded for many areas of public health, including pandemic preparedness and monitoring disease status across the population.
As public servants dedicated to improving public health and equity, we believe that to unite the nation’s fragmented public health system, the United States should establish a standing capacity to collect, harmonize, and sustain a wide range of data types and sources. The public health data collected by N3C would ultimately be but one component of a rich landscape of interoperable data systems that can guide public policy in an era of rapid environmental change, sophisticated biological threats, and an economy enabled by biotechnology. Such an effort will require new thinking about data collection, infrastructure, and regulation, but its benefits could be enormous—enabling policymakers to make decisions in an increasingly complex world. And as the interconnections between society, industry, and government continue to intensify, decisionmaking of all types and scales will be more efficient and responsive if it can rely on significantly expanded data collection and analysis capabilities.
Some of the challenges that must be overcome are already obvious, such as the need to collect and analyze more public health data. A pathogen surveillance system, for example, will require integrating electronic health record data from hospitals along with county vaccine data, school tracking, and wastewater sources. Vaccination data must be gathered from hospitals, pharmacies, registries, and mass-vaccination sites; this capability would allow for more effective assessment of vaccination’s true health impacts.
But an even greater set of challenges and opportunities looms in the current revolution in the life sciences. The next-generation bioeconomy has the potential to transform health, agriculture, industry, and the environment. Gene editing technology, for example, may someday eradicate entire classes of human and animal diseases and also safeguard crops from climate change. Biomanufacturing may enable the creation of radical new biomaterials to tackle important environmental and societal problems. And bioengineered plastics could enable integrated plastic waste management, preventing pollution and reducing oil consumption.
However, the transformative potential of the bioeconomy will also go hand in hand with new categories of risks that can only be understood and addressed through the collection and integration of new categories of data. For instance, rigorous data collection is the only way to enable physicians to determine whether new gene therapies, which have longer lasting impacts than many medications, benefit patients over the long term. More data are also necessary to understand the way that genetically engineered crops, which could help protect the food supply from rising temperatures, might also trigger subtle shifts in human microbiomes that impact health. And whether novel biomaterials beyond plastics could have unforeseen consequences for humans, animals, and the environment is poorly understood.
Armed with sufficient data, policymakers and others will be able to consider more fully the impact of these new technologies on human health, ecosystem dynamics, and the whole environment. These data could also be used to evaluate complex risks and determine responses. To gain this capability, however, the process of building the necessary data infrastructure must begin today.
To start, policymakers, researchers, and industry professionals must recognize that a data infrastructure for the future will necessarily be multisector and multiple stakeholders must be included from the inception of the project. One of the lessons of N3C is that innovative communications and technical coordination of multisector participants across institutions, government agencies, and commercial partners is essential in creating shared data infrastructure and making the most of its use.
In addition, new policies must create competitive market dynamics that allow the best data management products to succeed by taking a portfolio approach to the development of software systems. One important lesson from the HITECH Act, the 2009 legislation intended to expand health information technology, is the danger of optimizing for a single compliance metric. In the healthcare information technology created by the legislation, a systemic focus on checking boxes rather than investing in next-generation technologies predictably failed to develop the capability to serve current and future health and disease analyses and early warning needs.
Rather than seeding the creation of a dynamic ecosystem of upstarts, the emphasis on compliance privileged existing vendors with experience navigating the healthcare ecosystem. Moreover, because the purchasers of electronic health record systems (healthcare executives) are not the same as the users (patients, providers, and researchers), the resulting ecosystem lacked the typical competitive dynamics in which user behavior and preferences would drive the development of better and better products that would result in a better care-delivery environment.
Furthermore, coordination of cross-sector and cross-agency infrastructure has not been the norm, and the development of systems has been neither transparent nor designed for interoperability. Learning the lessons from past mistakes in medicine and public health, key stakeholders in the federal government, academia, industry, and science philanthropy must work together to create a software ecosystem with many players, where healthy competition drives the development of necessary tools and infrastructure.
Managing this infrastructure will require teams and organizations that are different from traditional research groups in public health, epidemiology, climate change, agriculture, or environmental health. These groups will produce infrastructure, tools, resources, and policies that will likely have greater impact on society than traditional academic measures such as publication metrics, and these outcomes will need to be incentivized. As demonstrated by N3C, coordinating all these activities with excellence in operational management, communications, and technical documentation is critical to efficient and effective infrastructure. But in today’s world, these are often grossly undervalued.
Finally, data governance capable of supporting secure, accessible, and interoperable research across sectors will require policy innovation and new laws. Part of this governance will include norms that enable the fusion of academic research labs with industrial operations, blending cultural and operational elements from both. This governance must be designed to protect individuals and communities from harm, while addressing the current power imbalance: people share a plethora of data about themselves, but then don’t have access to some of the most important findings from such data to actually help themselves. In fact, current policies designed to protect the privacy of individuals’ healthcare data actually incentivize organizations to sell those data, even when the sale of anonymized medical records is not in the patients’ interest. However, it doesn’t have to be that way—there are in fact models and technological solutions for security and data governance that support privacy and citizen sovereignty. Understanding, governing, and communicating the tension between protecting individual privacy and using data about the population for public good is crucial to building an equitable data infrastructure.
There are historical precedents for this unique data infrastructure. In the early years of the atomic age (i.e., the late 1940s), the government established the Advisory Committee on Biology and Medicine and a Division of Biology and Medicine. These programs were designed to monitor the health of atomic workers, but they grew to study genetics and eventually became the precursor to the Human Genome Project. While citizens rightly have concerns regarding the government collecting information about them, they have also seen such data collection efforts provide huge benefits to themselves and to their communities. This has been particularly salient in the current pandemic, where individuals can access information about where the virus is spreading based on their own reporting, phone apps, and voluntary testing programs.
Shared, well-governed data infrastructure could provide huge social and economic benefits that help overcome inequities in access to resources as diverse as healthcare, food, education, and agricultural products. But doing this poorly will have drastic consequences, exacerbating inequities and leaving the nation subject to greater threats of all kinds, including disease, biowarfare, cyberattacks, and food insecurity. Moreover, in not establishing an effective data infrastructure, the United States will fall behind other nations and be left even more vulnerable.
American society may soon be transformed by radical capabilities to engineer biology. Collective health and security require that government, private, and nonprofit organizations proactively build the capacity to anticipate and prepare for adverse events at a large scale and to enhance decisionmaking at multiple levels. Novel biomanufacturing capabilities will require data sharing and coordination between private industry, the nonprofit sector, and the government to assess threats to public and ecological health and to weigh sustainable, protective response options.
Society cannot wait until the next biological threat arises, whether it is a pathogen, biomaterial, or engineered food source. For the nation’s safety and wellbeing, health and research organizations must build the data infrastructure for a future US bioeconomy today.