A National Academy of Reliable Infrastructure Management
Control room managers are vital to the functioning of critical infrastructure. A new academy could help advance and share their expertise to train the next generation of infrastructure professionals.
Two of 2021’s biggest catastrophes, the February collapse of the Texas electric grid and the ransomware attack and shutdown of Colonial Pipeline in May, show how important control room managers are to the functioning of the infrastructures that support the US economy and society. Yet this unique type of management is missing from the bipartisan investment plan for new US infrastructure construction and renovations. Nor is real-time management central to other initiatives such as the National Infrastructure Bank, which was proposed in 2007 and is now resurfacing in the policy mix through 2020 legislation.
Real-time infrastructure managers are the invisible glue that fastens together modern society, the professionals who sit in the center of information and systems, adjusting them to keep things moving. They are repositories of the enormous amounts of expertise it takes to operate large, dynamic critical systems ranging from energy grids, urban water supplies, and flood protection systems, to telecommunications, vessel transportation, and aviation networks. And as new systems are patched onto older ones, the need for these real-time professionals to keep the systems reliable and safe increases. Their mandate is high-reliability management—ensuring the safe and continuous provision of what society considers to be core services, especially during turbulent times.
As the United States enters a period of shoring up old infrastructures and building out sophisticated new ones, we need to bank and enhance the knowledge of real-time system managers, particularly those who work in centralized control rooms of critical infrastructures. To achieve this, I propose a National Academy of Reliable Infrastructure Management. The new academy would remediate the nation’s infrastructure crisis by enhancing and advancing high-reliability management—both by managers themselves and by encouraging the study of this management. The academy would be unique in building an understanding of this specialty, which thrives in a space between and often beyond the domains of engineering, systems modeling, and economics.
Why an academy for infrastructure management? As demonstrated every day, large critical infrastructures must manage in anticipation of their technology, formal designs, and published regulations. The academy’s challenge would be to ensure that the tasks and demands of rapidly changing infrastructure technologies are matched to the people with the skills and expertise to manage them. The only way to do that is to bring together this under-recognized class of experts from the United States and around the world, allowing infrastructure control room operators and managers (who often have long experience and variable formal education), along with their immediate expert support staff (who are more likely to have higher formal degrees), to share and advance their expertise in and outside their own control centers.
Through projects, studies, and other advisory and convening activities, the academy would consolidate a community of practice among these experts, while pursuing a mission to assess, assemble, and advance evidence-based findings for real-time reliability and safety management of infrastructures under twenty-first century conditions. In doing so, the academy would have the prestige to facilitate research access to major infrastructure control centers whose entry is currently restricted for proprietary or security reasons.
During this time of rapid technological and environmental change, the academy could foster the management skills necessary to navigate the interdependencies and interconnections of critical infrastructure sectors. This would require a focus on critical national services such as water, electricity and natural gas, hazardous liquids transmission, and aviation, while ensuring their reliable and safe interconnectivity. Today, hardly any segment of US infrastructures is physically disconnected from other sectors. For example, natural gas is used to provide electricity, which supplies the water needed by the refineries that process the hazardous liquids, including Jet A-1 fuel for aviation.
Real-time system interconnectivity
Although infrastructure interconnectivity is essential to the continuous functioning of society, it has rarely been analyzed and improved as it happens in real time. Consider what happens when, say, an explosion occurs at a major natural gas reservoir. Immediate staff and the regulator of record begin a root-cause analysis, a process of zooming down to determine what precipitated the explosion. But no one is officially tasked with understanding the knock-on effects of the crisis on the interconnected infrastructure, a process that would include zooming out to see how the ripple effects moved through the connected systems.
Identifying the cause of the explosion is obviously important to prevent further ones from happening at this and other reservoirs, but that does not go far enough in making sure that connected systems are managed reliably and safely. Instead, a thorough analysis of control room behavior and its options should include a review of the accident within the larger ecosystem of interconnectivity. What happened to infrastructures that depend on natural gas for their own operations during the explosion and in their next steps ahead? To my knowledge, the regulators of record do not work together to answer the latter question, routinely or as a matter of priority.
Such questions would be of core concern to the new academy. Are the control centers able to compensate for the loss of a reservoir in real time? Did they keep the crisis from spreading to other parts of their transmission and distribution systems, including the variety of end-use customers? How did they stumble, and what other parts of their systems were vulnerable?
These systemwide and intersystem assessments—as well as the knowledge they unlock—will be necessary to keep future infrastructures connected under complex and dynamic conditions. In particular, the assessments will build an established body of evidence about how changes to high-reliability management, often instituted following a disaster, will affect the functioning of critical infrastructures at the system and intersystem levels. Keeping with our example, will new regulatory requirements, when implemented, undermine the previously proven capacity of the infrastructure’s control room to prevent disruptions from cascading across the natural gas system or beyond? No regulator of record is tasked to answer that question or similar ones about cascade potential. Providing such answers would be a primary goal of the new National Academy for Reliable Infrastructure Management.
Elevating a critical field
Analysis and planning for infrastructure reliability are generally led by engineers, economists, and system modelers, while the perspectives of real-time control room operators are not often a priority. This springs, in part, from narrowed professionalism—I’ve often heard comments to the effect that “Control room operators aren’t really experts, like the engineers and economists with whom they work,” and “Control rooms aren’t really innovative; in fact, they’re the opposite.” These comments reveal a deeper tension; operators are entirely focused on preventing systemwide failure, so the discipline often finds itself in a position orthogonal to those professions that insist innovation is impossible if you’re not prepared to fail.
Another disciplinary difference is that engineers generally view a system as either in normal operations or otherwise failing. But it is during the state of temporary service disruption that operators demonstrate their skills and ability to restore service. This asymmetric focus may account for some of the cultural differences that have plagued engineers and control room operators and, more recently, between operations and information technology professionals.
Sometimes the cultural differences are expressed pejoratively, as when one engineer I interviewed called control room operators “Neanderthals.” There are of course exceptions, but control room operators are generally seen by engineers as barriers to the advanced running of infrastructures—witness the familiar engineering calls for new hardware and software to “correct for operator error.” In contrast, there are no equally familiar accounts of how many disasters control rooms regularly avert, and what this proven value is to society.
A freestanding academy could begin to constructively address and evaluate these differences in expert orientations with an eye toward keeping changing infrastructures running smoothly, making the most of various disciplinary perspectives.
Fostering real-time innovation
The primary task of the academy would be to establish a body of knowledge about how to advance high-reliability management in real time for increasingly complex infrastructure across unpredictable futures. There are two areas where such study could provide significant returns, both to society and in terms of financial costs.
The first is to focus on the underappreciated precursors of service disruption and infrastructure failure to establish early warning signals to avert disaster. Here, control room operators, who are constantly thinking about complicated “what if” scenarios, have a distinctly more nuanced view than systems engineers and modelers, who tend to think in binary terms of system success or failure. System modelers often seek worst-case scenarios where the entire system fails consecutively or immediately.
By contrast, real-time operators are not in a position to ignore the fact that brownouts often precede blackouts, some levees are seen to seep long before failing, and the electric grid’s warning indicators of disruption or failure typically increase beforehand. While facing many thousands of daily cyberattacks, infrastructure professionals have to be skilled in both systemwide pattern recognition and in localized scenario formulation. The real-time indicators that control room operators rely on for monitoring the overlapping sets of precursors are, in my view, too rarely recognized by the regulators of record or in system models of interconnectivity.
The academy’s task would be to advance such leading indicators of systemwide disruption or collapse with the goal of preventing cascades before they occur. This is particularly important at a time when critical infrastructures are operating at, or beyond, their performance edges, and control operators are increasingly pushed to work outside upper and lower bounds of reliable and safe performance. It’s also important because technical systems are becoming more opaque, making systemwide patterns harder to recognize. All of these converging and stacking changes mean that what-if scenarios cannot be formulated with the same level of granularity as in the past. Establishing an academy would help sustain the nation’s attention for measuring and monitoring these and other tipping points and transitions.
A second important role for the academy would be to understand how real-time reliability professionals in the control rooms and their immediate support staff innovate on the fly, and to measure how much money this saves for society. In fact, the skill, expertise, and team-situation awareness of operators often compensates for incomplete or otherwise defective technology, design, and regulation. Thus, the culture of an infrastructure control center is a rich source of insight on how to establish and sustain steadiness and quick reflexes amid complex technological constraints. The academy could investigate, elevate, and disseminate these better organizational and management practices, skills, and core competencies.
Another task would be to explore how control room changes and innovations fare against the “reliability matters” test: do the innovations in design, technology, and software work to give real-time professionals more options and flexibility, or reduce them? Practically (not just ideally), innovations must increase the real-time ability to maneuver when responding to different—often unpredictable or uncontrollable—performance conditions. Among the many control room operators interviewed, I never met one who was against any innovation that increased their options, reduced task volatility, and/or increased maneuverability across changing performance conditions. I have, however, met economists, engineers, and consultants who ignore this reliability-matters test, just as they dismiss workarounds—both being, in their view, proof of a control room’s “resistance to change.”
The academy will not be able to stop the premature introduction of novel software and hardware into systemwide operations, but it can monitor their real-time management impacts, forced errors, and interconnected knock-on effects and work to resolve the way disciplinary tensions can be encoded in software and systems to the detriment of both operators and society.
Finally, control room operators should be considered innovators in the way they assess system risk. Virtually every discussion of new designs and technologies includes talk of trade-offs, but for control room professionals—as for the infrastructure-reliant public—high reliability in real time is not something to be traded. That is, high reliability is an order of magnitude more important than cost or efficiency when the safe and continuous provision of the critical service matters. No number of economists, engineers, and system modelers insisting that system reliability is actually a probability estimate of meeting a standard will change the real-time mandate that systemic disasters must be prevented from ever happening.
This gives operators a unique perspective in debates about cost and engineering. For them, as with much of the public, system safety is a real thing: a nuclear reactor must not blow up, urban water supplies must not be contaminated by cryptosporidium (or worse), electricity grids must not fail, jumbo jets must not drop from the sky, large dams must not breach or overtop, and autonomous underwater vessels must not jeopardize the oil rigs they are repairing. That disasters can and do happen reinforces the shared dread and commitment of the public and control operators to this precluded-event standard.
By understanding the shared interests of operators and the public, it’s possible to see the inadequacy of the conventional question, “What could go wrong here and how likely is it?” It would be far better, from this perspective, to look at similar systems and similar conditions to find out what is working or working better. There’s a vast difference between saying that the risks to be managed follow an abstract standard of “high reliability” and, in contrast, that the standard being managed emerges out of the risks found acceptable by the public. Properly elevated and further professionalized by the academy, the high-reliability management of the nation’s infrastructures would help the public better understand our increasingly complex socio-technological landscape of multiple risks and subsequent demands on the public treasury. All of this will help managers administer systems as if people’s lives and livelihoods depend on it—because they do.