Cyberinfrastructure and the Future of Collaborative Work
Online sharing of data, computing power, and expensive equipment is transforming research and blazing the trail for widespread advances in cooperative efforts in all human endeavors.
One of the most stunning aspects of the information technology (IT) revolution has been the speed at which specialized, high-performance tools and capabilities originally developed for specific research communities evolve into products, services, and infrastructure used more broadly by scientists and engineers, and even by the mass public. The Internet itself is the best example of this phenomenon. We have come to expect that the performance of one era’s “high-end” tool or application will be matched or exceeded by the next era’s desktop software or machine. Although the funding of IT-intensive research communities is justified by the results they produce in their own domains, the added value that they provide by proving concepts, developing features, and “breaking in” the technology to the point where it can be adapted to serve larger markets can in some cases have even greater social and economic benefit.
Cyberinfrastructure-enabled research is one area in which today’s cutting-edge researchers are building the foundation for a quantum leap in IT capability. Taking advantage of very-high-bandwidth Internet connections, researchers are able to connect remotely to supercomputers, electron microscopes, particle accelerators, and other expensive equipment so that they can acquire data and work with distant colleagues without traveling. There are several drivers of this trend.
First, in many fields the basic tools to do cutting edge research are now so prohibitively expensive that they cannot be bought by every lab and campus, or even by every country. In the case of high-energy physics, CERN’s Large Hadron Collider (LHC) in Europe promises capabilities unmatched by any other shared facility on the planet. The Superconducting Supercollider, which the United States considered building in the 1990s, would have been a rival, but Congress decided against funding it. As a result, U.S. high energy physicists who want to participate in the search for elemental particles or forces must conduct their experiments in Europe. Even in the life sciences, where the equipment is usually not as expensive, devices such as ultra-high-voltage electron microscopes are beyond the budgets of most universities, and the government funding agencies cannot afford to underwrite the cost for everyone. For example, there has not been a new high-voltage electron microscope fielded for use in biological or biomedical research in the United States for more than 30 years. For this work U.S. researchers depend on the generosity of the Japanese and Korean governments.
A second driver is that a growing number of science and engineering fields are becoming data and computation intensive. This has certainly happened in the life sciences, with the mapping of the human genome and the emergence of new data-intensive fields such as genomics, and is also occuring in the earth sciences and in other fields. Even though computing power continues to decline rapidly in cost, researchers are demanding access to more computing capacity and capability with supercomputers, linked “clusters” of commodity computers, and advanced high-bandwidth networks connecting these resources. This new model of a shared distributed system of advanced instruments and IT components involves deployment of elements that are much too expensive for most universities.
The third driver is the relentless progress of IT itself in making it easier and more affordable to share research data, tools, and computing power. The three elements of emerging distributed IT systems—data storage, networking, and computational capacity/capability—are all advancing at exponential rates, with the price of a given unit of each element continuing to drop rapidly. Networking advances more rapidly than storage, which in turn is on a steeper curve than computing power, but the general implications are clear. The falling prices make it more affordable to link scientific communities with advanced instruments and high performance computing and to connect distributed databases and other resources in end-to-end environments that can be accessed in real-time through a simple and user-friendly interface.
The challenge is to figure out how to use this advancing capability most productively and to create the institutional and policy framework that facilitates research collaboration through this cyberinfrastructure. The solution will differ somewhat from field to field, but it is helpful to look closely at one area that is leading in realizing the potential of the technology.
The Biomedical Informatics Research Network (BIRN) is a National Institutes of Health (NIH) initiative that fosters distributed collaborations in biomedical science by using IT innovations. Currently BIRN involves a consortium of 23 universities and 31 research groups that participate in infrastructure development or in one or more of three test-bed projects centered around structural and/or functional brain imaging of human neurological disorders including Alzheimer’s disease, depression, schizophrenia, multiple sclerosis, attention deficit disorder, brain cancer, and Parkinson’s disease. The BIRN Coordinating Center, which is responsible for developing, implementing, and supporting the IT infrastructure necessary to achieve distributed collaborations and data sharing among the BIRN participants, is located at the University of California at San Diego (UCSD).
BIRN is using these initial test-bed studies to drive the construction and daily use of a data-sharing environment that presents biological data held at geographically separate sites as a single, unified database. To this end, the BIRN program is rapidly producing tools and technologies to enable the aggregation of data from virtually any laboratory’s research program to the BIRN data system. Lessons learned and best practices are continuously collected and made available to help new collaborative efforts make efficient use of this infrastructure at an increasingly rapid pace.
Another activity at UCSD that complements and supports BIRN and other cyberinfrastructure efforts in the life sciences is the Telescience Project, which emerged from the early efforts of researchers at the National Center for Microscopy and Imaging Research (NCMIR) to remotely control bio-imaging instruments. In 1992, NCMIR researcher demonstrated the first system to control an electron microscope over the Internet. Researchers at a conference in Chicago were able to interactively acquire and view images via remote control of one of the intermediate voltage electron microscopes at NCMIR and to simultaneously refine these data using a remotely located Cray supercomputer.
In the mid-1990s, Web-based telemicroscopy was made available to NCMIR’s user community in the United States and abroad, which was then able to effectively use the remote interface to acquire data. It became clear, however, that to make the most of this capability it would also be necessary to link this data acquisition with enhanced data computation and storage resources. The Telescience Project was developed to address this issue. The Telescience Project provides a grid-based architecture to combine the use of telemicroscopy with tools for parallel distributed computation, distributed data management and archiving, and interactive integrated visualization tools to provide an end-to-end solution for high-throughput microscopy. This integrated system is increasing the throughput of data acquisition and processing and ultimately improving the accuracy of the final data products. The Telescience Project merges technologies for remote control, grid computing, and federated digital libraries of multi-scale data important to understanding cell-structure and function in health and disease..
The Telescience Project serves as a “skunk works” for BIRN, allowing new concepts and technologies to be developed and tested before insertion into the BIRN production environment. In turn, building an end-to-end system is a great way to find out what works and what does not. If researchers are unhappy with the performance, they will not use it. On the other hand, if a system is easy to use, it opens a wealth of possibilities, as we have found in building and fielding the BIRN. Science is a dynamic, social process. The cyberinfrastructure-enabled environment facilitates new forms of collaborative research, whose dynamism further drives the research, which in turn further drives the development of new IT tools and capabilities.
For example, collaborating scientists in Kentucky and Buenos Aires could schedule an experiment on the $50million Korean high-energy electron microscope. The scientists could jointly drive the operation of the microscope. To ensure that they collect the most useful data, they can generate preliminary three-dimensional (3D) results that are immediately streamed for analysis to computers that are dynamically selected from a pool of globally distributed computational resources. The resultant output can then be visualized in 3D and used by the scientists to determine how to guide the work session at the microscope. Throughout this process, raw data, data intermediates, and meta-data can be automatically added to integrated databases. In order to complete this type of data-driven remote session in a reasonable time period, the networking paths among resources must be intelligently coordinated along with the data input and output of the variety of software applications being used. As material is added to the integrated databases, the same researchers or an entirely different team of scientists can be mining the data for other uses. Through the innovations of the Telescience Project and BIRN, these tightly integrated sessions are not only possible, they are increasingly routine, secure, and accessible via a web portal with a single sign-on. More important, through the convergence of usability, computational horsepower, and richly integrated workflows, the end-to-end throughput for generating scientific results is increasing.
Using the BIRN infrastructure, scientists are developing new insights and resources that will improve clinicians’ ability to identify and diagnose problems such as Alzheimer’s disease. BIRN researchers at the Center for Imaging Science at Johns Hopkins University, collaborating with other BIRN researchers, developed a pipeline to enable seamless processing of shape information from high-resolution structural magnetic resonance scans. In the initial study, hippocampal data from 45 subjects (21 control subjects, 18 Alzheimer’s subjects, 6 subjects exhibiting a rare form of dementia, called semantic dementia) were analyzed by comparing these 45 human hippocampi bilaterally to one another (4,050 comparisons for both left and right hippocampi). This large-scale computation required over 30,000 processor hours on the NSF-supported TeraGrid and produced over four terabytes of data that were stored for subsequent analysis on the NIH-supported BIRN data grid. Using the results of the shape analyses, BIRN researchers were able to successfully classify the different subject groups through the use of noninvasive imaging methodologies, potentially providing clinicians with new tools to assist them in their daily work.
Work is now under way to extend the Telescience Project and BIRN to create an integrated environment for building digital “visible cells,” a multiscale set of interconnected 3D images that accurately represent the subcellular, cellular, and tissue structures for various cellular subsystems in the body. With the increasing availability of biological structural data that spans multiple scales and imaging modalities, the goal is to create realistic digital models of cellular subsystems that can be used to create simulations that will rapidly advance our understanding of the impact of structure on function in the living organism. Deciphering the structure-function relationship is one of the grand challenges in structural biology. The Telescience Project and BIRN are knitting together an IT fabric and coordinating interdisciplinary teamwork to build a shared infrastructure that integrates the tools, resources, and expertise necessary to accelerate progress in this fundamental domain of biology.
The benefits of this effort are not limited to cutting-edge research. Telemicroscopy has been successfully used in the classroom to expose students to interdisciplinary research, and the visible cell project will produce materials that are useful in K-12 education.
Cyberinfrastructure-supported environments such as BIRN are changing science. They can significantly improve the cost effectiveness of research so that society will get more basic research bang for its buck. This can help relieve some of the pressure on government funding agencies, such as NIH, which must emphasize translational-research accomplishments most often based on advances resulting from investments in basic science. Enhanced research productivity resulting from shared infrastructure can reduce costs attributable to geographic duplication of facilities, thus helping to balance the investments supporting applied and basic research and basic discovery or curiosity-driven endeavors—both of which are critical to addressing major societal needs.
A second possible implication is that cyberinfrastructure-supported environments will enable interdisciplinary teams to be more effective in attacking the big long-term or “stretch” goals in research. This could change the sociology of the research enterprise from a focus on individual goals, which are generally of a short-term character, and provide incentives leading to a greater emphasis on effective collaboration in interdisciplinary, interinstitutional settings. This will call for universities to change the way in which they evaluate and reward faculty members and will hopefully create new motivation for scientists to work together in ways that will speed progress.
It will also allow researchers at smaller research institutions in the United States and around the world to participate in cutting-edge research in fields where cyberinfrastructure is sufficiently advanced. It will create the research equivalent of the global business networks that Thomas Friedman describes in his recent book The World is Flat.
Although all disciplines can learn from BIRN’s experience, the sociological transition from physical to virtual research community is a process that each field and discipline needs to go through on its own. At BIRN, we may develop portals and other specific tools with possible application to cyberinfrastructure efforts in other fields, but the hard work of designing, building, using, and continuously improving end-to-end production environments does not seem to be amenable to easy transfer from field to field because it must be adapted for the equipment used, the nature of the data collected, and the types of collaboration that are appropriate. But that also can change.
NSF Director Arden Bement has talked about his vision of the day when cyberinfrastructure joins “the ranks of the electrical grid, the interstate highway system and other traditional infrastructures. A sufficiently advanced cyberinfrastructure will simply work, and users won’t care how.” Bement’s remark points to what we believe will be the most important long-term effect of cyberinfrastructure. Just as the Internet has become a tool that is used by everyone, the research cyberinfrastructre we are creating today will evolve into the basic platform for all manner of collaborative knowledge work in the future, affecting business and commerce, entertainment, education, health care, and just about every other human activity.
If the cyberinfrastructure is to deliver on its enormous promise, progress is essential in two areas. First, more intensive work is needed on frameworks for data integration. As data storage becomes less expensive, we are faced with a rapidly growing mountain of data. In order to be useful, the data need to be accessible and integrated into other data sets. We need more effective ways to bring data together on the fly in ways that can be visualized and understood by a researcher. Google is a useful analogy. What we need is really a “Google on steroids.” The beginning of what is needed can be seen in the efforts of researchers to integrate the enormous amount of data accumulating about the diversity of human genotypes.
Although fundamentally a software problem, data integration is also an organizational challenge because there are different approaches being developed within the scientific community. One can choose a more prescriptive approach by requiring the definition of specific meta-data entities that must be used by all sources, as is done by the Cancer Bioinformatics Grid operated by the National Cancer Institute, or one can develop more flexible methods to bring together diverse data sources that may not be based on similar standards, as is being done in BIRN. Although the two approaches seem mutually exclusive, they are actually complementary and provide a fertile area for collaboration between these projects.
A second challenge involves networking architectures. Technological advances are creating the potential for networks to be managed much more efficiently. For example, OptIPuter, a project of the Cal-IT2 Institute of UCSD and the Electronic Visualization Laboratory at the University of Illinois at Chicago, is developing an innovative approach to using fiberoptic data connections. Many messages travel on that cable simultaneously, and the standard protocol is to convert large blocks of data into small packets that travel together with other messages from other users. In the OptIPuter approach, traffic is managed so that one project can own a specific wavelength (color) on optical fiber path (like a dedicated line) for an instant. This impulse-based management is better adapted to the nature of the data flows in scientific research and will also be better adapted to emerging network use patterns outside of research. But some of the entrenched, and occasionally struggling, commercial interests that control the telecommunications infrastructure have a large financial stake in staying with management systems that favor a constant, high level of traffic over the sequential, impulse-based approach. Taking the necessary regulatory steps to remove these sorts of barriers and provide incentives for adoption of revolutionary new technologies is a significant policy challenge.
In addition, the data integration and networking challenges feed on each other, since greater access to storage leads to greater demand for networks, and vice versa. This points to another area for necessary policy focus in the coming years: the need to build a broad base of support for investments in these areas.
Certain pieces and functions of cyberinfrastructure have developed dedicated policy communities and constituencies that focus attention on their current and future needs. High-performance computing, with its support base and practitioners in the agencies, national labs, academia, and industry, is a good example. Groups such as the Council on Competitiveness effectively stimulate discussion of issues and concerns related to computing needs. Other organized constituencies exist for research networking, where coalitions such as National Lambda Rail and Internet2 have emerged to provide effective focus and leadership, and digital archiving, where libraries and librarians effectively promote their needs.
Yet outside the affected domains themselves, and their traditional sources of support, there has been no broad constituency or leadership institution concerned with developing the integrative tools and systems needed to build end-to-end systems such as BIRN. It would be particularly useful if there were more effective support for interdisciplinary programs that engage appropriate parts of the computer science community in cooperative work with domain scientists addressing these problems. This is how the key tools of middleware—soft-ware that enables novice and expert alike to access resources and participate—are being developed now, but we are really only at the beginning of this process. New software architectures are needed to integrate huge distributed systems, but the computer science community and other stakeholders will not come to the table without the right incentives. Perhaps the long-term solution is to train a core of scientists and engineers within each field who have special expertise in IT as well as their primary specialty in a manner similar to how molecular biology has become embedded in nearly every bioscience subdiscipline over the past 20 years.
Several years ago Daniel Atkins at the University of Michigan chaired a blue-ribbon panel for NSF’s Computer and Information Science and Engineering Directorate that recommended a significant new interagency effort in cyberinfrastructure. Now it appears that NSF is starting to take on the necessary leadership role called for in the Atkins report. That is an encouraging step, but what is really needed now is a broader interagency initiative that takes advantage of the research and institutions supported by NIH, the National Oceanic and Atmospheric Administration, the Department of Energy, and others to accelerate the process of building new virtual research communities. Relatively modest investments by the Department of Defense, NSF, and other agencies led to today’s commodity Internet and the enormous new industries and markets that the Internet has made possible. With the right approach to cyberinfrastructure, the United States can again leverage investments in new research resources to build the foundation for the next IT revolution.
Mark Ellisman ([email protected]) is professor of neurosciences and bioengineering and director of the Center for Research in Biological Systems at the University of California at San Diego.