The National Academies of Sciences, Engineering and Medicine
Board on Research Data and Information
Policy and Global Affairs
Quick Links


Contact Us
Board on Research Data and Information
Policy and Global Affairs Division
The National Academies of Sciences, Engineering, and Medicine
500 Fifth Street, NW
Washington, DC 20001


January 23, 2009 DRAFT
The Future of Scientific Knowledge Discovery in
Open Networked Environments[1]
A National Symposium and Workshop
U.S National Committee for CODATA
Board on International Scientific Organizations
in collaboration with
Computer Science and Telecommunications Board
The National Academies
Project Summary
Digital technologies and networks have enhanced access to and use of scientific data, information, and literature significantly, and also have great promise for accelerating the discovery and the communication of knowledge both within the scientific community and in the broader society. This is particularly the case for scientific data and information that are openly available online. Scientific knowledge discovery in open networked environments, referred to in this proposal as computer-mediated or computational scientific knowledge discovery (SKD), may be defined as a research process that is enabled by different digital computing technologies such as data mining, information retrieval and extraction, artificial intelligence, distributed grid computing, and many other automated methods. Together, these technological capabilities are supporting the emergence of computer-mediated SKD as a new paradigm in the conduct of research.
A symposium and workshop will be convened at the National Academies to bring together key stakeholders in this area for intensive and structured discussions in order to obtain a better understanding of the state-of-the-art and future trends in the study of computational SKD in the open online environment and to develop a range of options for future work in this area. Specifically, the project will be performed pursuant to the following statement of task:
1. Opportunities and Benefits of SKD: What are the opportunities over the next 5-10 years associated with the use of computer-mediated scientific knowledge discovery (SKD) across disciplines in the open online environment? What are the potential benefits to science and society of SKD?  
2. Techniques and Methods for Development and Study of SKD: What are the techniques and methods used in government, academia, and industry to study and understand these processes, the validity and reliability of their results, and their impact inside and outside science?
3. Barriers to SKD: What are the major scientific, technological, institutional, sociological, and policy barriers to computer-mediated SKD in the open online environment within the scientific community? What needs to be known and studied about each of these barriers to help achieve the opportunities for interdisciplinary science and complex problem solving?
4. Range of Options:  Based on the results obtained in response to items 1-3 above, define a range of options that can be used by the sponsors of the project, as well as other similar organizations, to obtain and promote a better understanding of the computer-mediated scientific knowledge discovery processes and mechanisms for openly available data and information online across the scientific domains. The objective of defining these options is to improve the activities of the sponsors (and other similar organizations) and the activities of researchers that they fund externally in this emerging research area. 
A one-and-a-half day symposium with invited expert speakers will be held to address tasks 1-3 above. This will be followed immediately by a one-day workshop to address task 4, based on the discussions of tasks 1-3 at the symposium and on their own expertise. A steering committee for the project will help to organize the symposium and workshop, and a rapporteur will prepare a summary report from the workshop. Both the workshop summary report and the symposium proceedings will be published at the conclusion of the project. The symposium program will be Webcast as well. The workshop report will synthesize the contributions of the experts and is expected to result in an authoritative and high-level review of the most promising and effective research opportunities in this area.
Intellectual Merit of the Proposed Activity
The ways in which scientific knowledge is discovered and communicated openly online have changed dramatically over the last three decades, and especially in the last ten years.  Multiple factors have contributed to massive growth in this area.  The production of vast amounts of data and literature, coupled with the rapid development and advancement of the processing, storage, and communication technologies have made it both possible and necessary to use more computing technologies and methods in the research process. Computer-mediated discovery processes are being increasingly adopted and used by different scientific communities to explore and study a wide range of domains of knowledge, including all scientific disciplines. Computational SKD is thus rapidly becoming practiced as a new form of scientific inquiry in the virtual environment, building upon and supplementing the research based on theoretical, experimental, and observational methods that preceded it.
At the same time, many new models of open science have been developed that take much greater advantage of the capabilities of digital networks. When integrated together online, various types of open knowledge resources are forming incipient information “commons” and knowledge environments, which derive more value from the public investments in research. Of particular interest to this proposed project, such mechanisms can enable more efficient and effective applications of digital SKD tools and techniques.
Although techniques and tools to study and improve the technical capabilities of these processes exist and are well-established in different contexts, studying, measuring and understanding the other various attributes of these processes such as validity, reliability, and impact are not very well developed within the scientific community, especially in the interdisciplinary research areas. This is important to address effectively because the more specialized and discipline-focused the area of research is, the more difficult is the flow and resulting utilization of the knowledge across disciplines and sectors. A focused examination of the opportunities and barriers to the study of the computer-mediated SKD in the open online environment by leading researchers in this area of inquiry, with a view to identifying a range of research options for improving the understanding of these processes and practices in interdisciplinary research, can lead to significant enhancements in both the near-term and beyond to the efficiency and effectiveness of the discovery and related communication of scientific knowledge.
Broader Impacts of the Proposed Activity
A better understanding of the study of new and emerging computational SKD processes for open science online can yield broad benefits not only to our nation’s research base, but to our economy and society. Within scientific research, such improved understanding can be used to make better decisions about information technology management and investments, organizational models, and research management, collaboration and policy, particularly for interdisciplinary studies. A deeper understanding of the opportunities and barriers to such processes holds the potential to accelerate greatly the progress of scientific research, to support U.S. national competitiveness and increased productivity in information-intensive areas of research and its applications, and to enable research managers and policy makers to make much more informed decisions about the research enterprise. Finally, improvements in the study of the automated discovery of scientific knowledge from openly available data and information on digital networks will help make it easier for the scientific community to explain more clearly to policy makers and to the taxpayers how the public investment in research and digital technologies advances broader socioeconomic interests.
Computer-mediated scientific knowledge discovery (SKD) applies a broad range of computing technologies and methods. For example, techniques such as data mining, information retrieval and extraction, artificial intelligence, distributed grid computing, and many others each may be applied in appropriate contexts. The development of the networked cyberinfrastructure, especially advanced research networks, has revolutionized both quantitatively and qualitatively the ways in which openly available scientific data and information are searched, stored, manipulated, analyzed, and communicated in a cycle of discovery and accelerated scientific progress. To a large extent, the internet and other related computational technologies have made the availability of and access to the vast and increasing amounts of scientific data and information much easier and cost-effective.
Quantitative and qualitative advantages of digital technologies and networks for research. Digital technologies and global networks have made possible — if not yet fully realized —  dramatic increases in the quantity and quality of knowledge discovery and communication in all realms of human activity, not least in science.
Among the well-known quantitative advantages that digital networks have over the previous print paradigm are in time, extent, and cost. Digital networks provide instantaneous, concurrent, and global availability at near-zero marginal cost of access by each additional user. These quantitative improvements make possible, at least in theory, the universal availability of information for both human and automated knowledge discovery, a capability that is only now beginning to be realized. At the same time, however, the ever-increasing amounts of digital data and information also pose huge challenges to users who depend on these resources for research and development.
Just as important, however, are the qualitative advantages afforded by digital technologies and networks in accelerating the discovery of knowledge. Because networks provide the opportunity for both interactive and asynchronous communication of text, data, images, and other media, the potential to access, develop, and transfer knowledge has also increased.
The digital nature of the information imbues it with much greater flexibility of access and use, making it easier to index, search, manipulate, and integrate across sources and types of integration to create new knowledge that was either not possible or much more difficult in the print context. Moreover, the network enables entirely new forms of collaborative knowledge production on a broadly distributed and/or interactive basis, with the potential for changing the more hierarchical and centralized organizational models through which knowledge was produced and communicated in previous eras. And, most important in the context of this proposal, digital networks make possible entirely automated approaches to information extraction, processing, integration, and organization of vast amounts of information, which can be transformed into unlimited new discoveries and products, eclipsing the capabilities of human knowledge production and its communication.
The rise of computational SKD as a new research paradigm in the open online environment.
The quantitative and qualitative features of digital networks constitute a new and still emerging paradigm in the conduct of research worldwide. These quantitative and qualitative features of digital networks can be exploited most completely, however, when the information is made freelyand openly available online, with minimum reuse restrictions.Box 1 identifies a range of distributed, open, collaborative research and information production and dissemination activities using digital networks.

Box 1:
There are many new kinds of distributed, open collaborative research and information production and dissemination on digital networks. Examples of open data and information production activities include:
  • Open-source software movement (e.g., Linux and 10s or thousands of other programs worldwide, many of which originated in academia and are developed for research purposes);
  • Distributed Grid computing or e-science(e.g., LHC@home);
  • Community-based open peer review (e.g., Journal of Atmospheric Chemistry and
  • Physics); and
  • Collaborative research Web sites and portals (e.g., NASA Clickworkers, Wikipedia, Curriki).
The following are examples of open data and information dissemination and permanent retention:
  • Open data centers and archives (e.g., GenBank, the Protein Data Bank, The SNP Consortium, Digital Sky Survey);
  • Federated open data networks (e.g., World Data Centers, Global Biodiversity Information Facility; NASA Distributed Active Archive Centers);
  • Virtual observatories (e.g., the International Virtual Observatory for astronomy, Digital Earth);
  • Open access journals (e.g., BioMed Central, Public Library of Science, + > 2500 scholarly journals);
  • Open institutional repositories for that institution’s scholarly works (e.g., the Indian Institute for Science, plus hundreds globally);
  • Open institutional repositories for publications in a specific subject area (e.g., PubMedCentral,
  • the physics arXiv);
  • Free university curricula online (e.g., the MIT OpenCourseWare); and
  • Emerging digital commons and knowledge environments (e.g., the Neurocommons, Conservation Commons).

A networked virtual telescope allows all types of users (not just astronomers) to conduct “sky searches” based on queries to multiple, distributed astronomical databases (Szalay and Gray, 2006). This new research method amplifies the capabilities of every physical telescope by enabling correlations of data that previously were either impossible or too expensive because the data were not in the same place, in the same format, or were not easily accessible. The costs for observational astronomy are now so radically changed that fundamentally different kinds of questions can now be asked and the integrated data can be brought more quickly, inexpensively, and effectively to all the world’s astronomers and other users.
In experimental materials science, distributed and collaborative computer modeling is rapidly advancing basic knowledge of molecular electronics, bringing together researchers in universities and industry with educators and students. Breakthroughs in such molecular electronics research can revolutionize the electronics industry by making the miniaturization of devices possible with atomic-scale active components (NSF, 2006).
In biomedical research, there are numerous initiatives that build upon the breakthroughs starting about two decades ago in bioinformatics and computational biology. For example, in recent years the National Cancer Institute has developed the cancer Biomedical Informatics Grid (caBIG). This federated and standardized infrastructure incorporates various online applications, software tools, and openly available data and information in an integrated platform to advance research in oncology [McKinney 2007]. And the Science Commons has initiated a new project, the Neurocommons, which will combine technical, semantic, and legal interoperability online to enable automated knowledge discovery by both university and industry researchers across the data and literature of various neurological diseases.
The growing list of distributed, open collaborative research and information production activities shows that the technological capabilities supporting the new computational SKD research tools have begun to be institutionalized in different disciplines and applications, raising new management and funding issues for the research community. Although many conferences, workshops, and reports in recent years have focused on the use of such tools within specific discipline applications, they typically have not addressed how these tools can be used effectively across disciplines and sectors to enhance scientific integration and improve complex problem-solving capabilities.
Not surprisingly, the rapid growth of openly available digital data and information collections and the generally untapped potential to exploit them for scientific discoveries and other applications has led to increased interest in applying more advanced computer-mediated discovery processes in practically all fields of research. As the 2003 NSF report, Revolutionizing Science and Engineering Through Cyberinfrastructure (Atkins, 2003) pointed out, “vast improvements in raw computing power, storage capacity, algorithms, and networking capabilities have led to fundamental scientific discoveries inspired by a new generation of computational models that approach scientific and engineering problems from a broader and deeper systems perspective. Scientists in many disciplines have begun revolutionizing their fields by using computers, digital data, and networks to extend and even replace traditional techniques”.
Fully realizing the vision of faster and more cost-effective computer-mediated SKD will depend on developing, deploying, refining, and further understanding the tools, methods and techniques that offer better scientific and technical capabilities, especially across disciplines for complex problem solving. Traditional approaches, such as those in the areas of statistical analyses and visualization techniques, have proven useful for tasks such as detecting statistical trends, the comprehension of complex or voluminous data, and correlations between attributes, to name a few. They are, however, limited in the types of knowledge and regularities they can derive from data. For example, running a statistical analysis can help in detecting a correlation between different factors, but cannot generate a conceptual explanation why such a correlation exists, nor can it formulate any specific quantitative or qualitative principle(s) responsible for this correlation.
The emerging knowledge discovery process used for the automated mining of large, openly available digital databases, for example, is a case in point. It typically follows several stages: data warehousing, target data selection, cleaning, preprocessing, transformation and reduction, data mining, model selection (or combination), evaluation and interpretation, and finally consolidation and use of the extracted knowledge. These stages frequently involve the work of different communities that can be the suppliers and consumers of these discovery processes and outputs. The database, statistical, machine learning, and visualization communities have been among the most important in the development of this type of computer-mediated SKD. More effective communication and collaboration among these communities and others can lead to better exploitation of the very large and diverse amounts of scientific data and information that are openly available online and ready to be used, and of the new tools, methods, and techniques that are emerging as a result of the continuous development and advancement of the cyberinfrastructure. Effective communication also can help in better understanding these processes, especially in the multi- and inter-disciplinary settings.
The need to exploit computational SKD more effectively for interdisciplinary research and complex problem solving
Furthermore, collaborative and technology-enabled open research environments provide new opportunities for researchers from geographically distributed locations to work with colleagues from other institutions. They also provide opportunities for junior scientists and students, even those not from what are traditionally considered research universities. For example, these new environments, according to the NSF Cyberinfrastructure report “can contribute to science and engineering education by providing interesting resources, exciting experiences, and expert mentoring to students, faculty, and teachers anywhere there is access to the Internet. The new tools, resources, human capacity building, and organizational structures emerging from these activities will also eventually have even broader beneficial impact on the future of education at all levels and likely on all types of educational institutions.” Maximizing the benefits and values obtained from investments in digital technologies, however, requires a multi-disciplinary blend of expertise in domain science or engineering, mathematical and computational modeling, numerical methods, visualization, and the socio-cultural understanding about working in new grid or collaboratory organizations.
Improving the study and understanding of the current computer-mediated SKD tools, methods, and techniques provides an important opportunity for promoting progress in these activities. There is still insufficient research in the area of computer-mediated SKD on evaluating and comparing the effectiveness of these methods in the open online environment. Furthermore, the questions about which methods enhance the ability for inter- or intra-disciplinary collaboration, and how they do so also need more investigation and empirical answers.
Finally, the communication of the outputs of the discoveries in a manner that promotes their optimal utilization in the digital environment for computational SKD research requires attention as well. Scientific knowledge discovery and communication of course are interdependent and increasingly so in the digital era. As noted at the outset, there is a cycle of processes in which knowledge is created and communicated, and the received knowledge is reused for further incremental and new discoveries within and outside science. Digital networks can make the flow of this discovery-communication-discovery cycle much easier, productive, and cost-effective—especially when such processes can be fully automated—but only if the many barriers to knowledge discovery and communication are well-studied, understood, and mitigated.
Barriers to Computer-mediated Knowledge Discovery in Complex, Interdisciplinary Research Although many of the opportunities and potential benefits of openly available data and information on digital networks to the acceleration of SKD are intuitive and widely recognized, there are numerous barriers to their full realization, especially across discipline boundaries for complex problem solving, that result from a range of scientific, technological, institutional, policy, and socio-cultural factors. Some of these barriers are consistent with factors already present in the pre-digital era, while others are new or magnified by the advent of digital technologies and networks. Furthermore, the barriers change either in their nature or intensity, depending on the complexity and narrowness of the types of discovery being pursued and on whether the research that is being promoted is at the intra-, inter- or extra-community level. Because of the high level of interconnection and interdependence between the discovery and communication processes, particularly in digital SKD, the barriers to this cycle of knowledge discovery and communication processes need to be considered together, rather than separately, in an integrated, strategic approach. Identifying, studying and analyzing these barriers is critical for properly addressing them and for developing better and more cost-effective computer-mediated SKD processes. Below, some of the barriers are briefly introduced, with a number of key problems highlighted and some preliminary questions suggested concerning the development of research strategies and frameworks.
Scientific factors Although much of scientific research is intensively specialized and (properly) narrow in its scope for advances in specific sub-discipline areas, at the same time research is increasingly complex, multi-scale, multi-disciplinary, and multi-institutional. The torrent of bits, particularly from exponential increases in data collections, but also from a rapidly growing body of research literature, makes the application of different computer-mediated SKD research not only extremely useful, but essential for accelerating the progress of science and resulting applications, particularly in the interdisciplinary arena.
Ø                  Different computational SKD strategies for experimental versus observational sciences. There are some important differences among scientific disciplines that can affect the adoption and implementation of the computational SKD research paradigm. For example, digital data in research based on observational and experimental methods have very different characteristics and uses, and at different stages of data processing, which are essential to understand in order to develop the appropriate strategies for the application of SKD tools and techniques (NRC 1995). On the one hand, the results of many laboratory experiments must be reproducible (with the exception of very expensive, unique experimental facilities) so the long-term preservation of the raw data from the experiment generally is not essential; rather, it is the collections of highly evaluated data that are especially important in such disciplines and used by many researchers. On the other hand, the results of observations, especially longitudinal data sets, are typically unique and not reproducible, so that long-term access to such unenhanced data collections by a broad range of researchers is essential.
Ø                  Key differences between large- and small-scale research. A similar dichotomy regarding the management and use of digital data exists between so-called “big” and “small” sciences (NRC 1997). The former refers to large-scale research programs, typically organized around complex and expensive data-collection facilities that produce huge standardized databases managed as common information resources for the scientific community. The latter is characterized by the research conducted by autonomous researchers, either on an individual basis or in small groups, who produce many heterogeneous, small, and non-standard data sets that typically are not shared or pooled for use by others. For various reasons, some of which are identified in the discussion of other barriers below, such heterogeneous small science data are much less amenable to digital SKD exploitation than the data produced through the big science activities. Clearly, these major types of research and digital research data constitute differences that are important to the potential application of SKD tools and techniques, and that require different research strategies.
Ø                  Overcoming barriers from specialization of research to the interdisciplinary applications of computational SKD tools. These differences are similarly pronounced in the access to and use of the scientific literature and in the process of scientific communication. The more specialized and discipline-focused the area of research is, the more difficult is the flow of the resulting knowledge at the three diffusion levels: inter-, intra- and extra- community. Scientific knowledge usually is written and explained in a discipline-specific technical language, which makes it very challenging for non-experts to read, understand, and reuse in other contexts and for further discoveries. This disconnectedness among the disciplines and the inability to be exposed to others’ work is magnified by the stovepiped organization of research and educational institutions according to disciplines, as well as discipline-specific conferences and meetings. The failures of effective communication among and between scientific disciplines, of course, negatively affect the processes of knowledge discovery and require special attention at the research institution and policy-making levels. It can be ameliorated through the use of various digital SKD tools that can automatically search, extract, analyze, and integrate data and information from diverse sources and disciplines, but only if those digital resources are properly prepared for such automated applications, and are made openly available and without excessive reuse restrictions. For example, automated approaches using new text mining procedures, also referred to as literature-based or literature-assisted discovery, can support unexpected, “radical” discovery and innovation (Kostoff, 2005).
Some questions concerning the characteristics of different disciplines and research processes in relation to computational SKD. Some of the questions that could benefit from more systematic study include the following: How can the scientific community facilitate computational SKD research for complex problem-solving? More specifically, what are the key differences among the disciplines in the open online environment to enable and promote multi- and inter-disciplinary research based on computer-mediated SKD? What are some successful models of computational SKD that can be emulated across disciplines and what are the key principles for success?
Technological factors. Although advances in software and applied mathematics have provided a variety of tools for database development and management, they also have created new challenges to the effective communication, access, and reuse of the knowledge that is produced.
Ø                  Technological and semantic incompatibilities. The technical incompatibilities (Hay and Nance, 2004, and Kurabayashi, et al, 2002) and semantic differences among the many scientific databases is a serious barrier that hinders the potential for better cooperation and interconnectedness between and even within disciplines. It is common to find that the original storage format and the format required by models and automated analytical tools are not compatible, requiring more effort on the part of researchers to find and apply a suitable conversion. This frequently occurs because databases and database management systems are designed for a particular purpose and without consideration to future computational SKD research, such as the use of data mining and other techniques. Of course, this is not surprising in the absence of adequate resources of incentives for doing otherwise.
Ø                  Heterogeneous data. More specifically, there are significant problems associated with having large amounts of scientific data stored at different locations and in heterogeneous formats and systems (i.e., the small science effects identified above). The diversity and heterogeneity of many data sources, as inputs for or as outputs of the discovery process, makes the successful and cost-effective information extraction from these sources difficult and negatively affects the ability for further human or especially automated exploration of information and related knowledge discovery. Although existing methods such as advanced statistical association, case-based reasoning, neural networks, rule induction, Bayesian belief networks, advanced algorithms, fuzzy logic, and rough sets theory can produce predictive models that can be relatively accurate, their outputs are not stated in terms familiar to most scientists, especially those from outside the discipline, and thus typically are not easy to communicate very effectively (Mitchell, 1997).
Ø                  Integrating old and new knowledge sources. Another challenge is that SKD tools, methods, and techniques typically focus on discovering new knowledge and thus can make it difficult to incorporate scientists' existing knowledge about one or multiple domains. Partnerships and collaborations in research activities can lead to better utilization of the collaborators’ infrastructure and equipment. In particular, improved communication at the beginning of research projects between scientists and engineers can help eliminate these, and other, barriers at the outset.
Some questions concerning technological opportunities and challenges in computational SKD. What is the state-of-the-art, what are the likely future directions in SKD technologies and techniques, and which of these present the most promising areas of research? What actions can be taken within and especially across the disciplines, scientific institutions, and funding agencies to identify the opportunities for more effective interaction and coordination with the developers of computer-mediated SKD technologies and related standards?
Institutional and organizational factors. The rise of digital networks has not only created quantitatively and qualitatively new opportunities for computer-mediated SKD, but has stimulated the development of new institutional and organizational models designed to take greater advantage of the cyberinfrastructure and SKD capabilities.
Ø                  New institutional and organizational models for research in the virtual environment. As discussed above, new models of collaborative, open online science have emerged for producing and disseminating scientific data, literature, and other information products. For the production of new knowledge, these have included open source software, virtual laboratories and observatories, distributed grid computing, and collaborative Web sites and wikis. For the dissemination and diffusion of data and information, there have been open digital data centers and active archives, open federated data networks, open access journals, open institutional and thematic repositories for the literature, and free university curricula online. When integrated together online, these open knowledge resources are forming incipient information “commons” and knowledge environments (NRC, 2003 and Uhlir and Schröder, 2007). These new institutional and organizational arrangements are being formed to derive more value from the public investments in digital research networks and knowledge resources. Of particular interest to this proposed project, such mechanisms can enable more efficient and effective applications of digital SKD tools and techniques, particularly across disciplines and domains of knowledge. They also have involved the formation and adoption of new economic, legal, and social arrangements and conditions for research. Yet little is known empirically about these new institutional and organizational models, or how the pre-existing and still dominant models based on the former print paradigm compare in terms of their ability to promote digital SKD applications and the communication of their results.
Ø                  Improving innovation processes. More broadly, studies of innovation systems and performance typically emphasize that a nation’s capacity for innovation depends in part on the integration of its industry with its publicly funded scientific infrastructure, as well as on the effective exchange of knowledge, ideas, and practices among the three main institutional sectors that support and conduct research: government, universities, and industry (e.g., Spencer, 2000 and Lodge, 1999). These key institutional sectors need to be fully cognizant of the potential benefits of computer-mediated SKD and diffusion processes, and have the requisite capabilities (e.g., technical, human and financial resources) and the willingness (e.g., plans and activities) to coordinate and promote these processes. Attention at the institutional level by the various research sectors can greatly improve the production and flow of scientific knowledge within and outside science, and consequently influence the nation’s innovation performance and scientific competitiveness at the global level.
Some questions concerning institutional and organizational factors in computational SKD. What are the most promising institutional and organizational models across scientific disciplines for taking full advantage of digital SKD tools and processes? Conversely, what are the key barriers at the institutional and organizational level, and how can they be overcome most effectively? What are the costs and benefits of the different models in the computational SKD research context? And what are the main barriers and some of the most successful models of communication and collaboration for computer-mediated SKD among the government, university, and industry sectors, and what needs to known to improve the effectiveness of such interactions?
Policy and legal factors. The laws and policies that are developed and institutionalized by the government concerning S&T research activities of course can have a significant impact on the ways in which research is developed, organized, and institutionalized, and determine the scope of potential spill-over to other fields. The extent to which access to and use of digital information is regulated in both the public and private sectors is especially important to creating incentives and disincentives for computer-mediated SKD and for radically speeding up the discovery-communication-discovery cycle in the open online environment.
Ø                  Policies and laws that promote or hinder computational SKD. At its core, the formulation and implementation of information law and policy reflects a balance between openness and restrictions. On the one hand, openness not only generally facilitates the exchange and use of scientific information, but also greatly simplifies the use of automated SKD techniques, especially across various boundaries (discipline, institutional, sectoral, and national). On the other hand, there are many valid reasons for imposing restrictions that frequently come into play, including proprietary rights in the information products and the knowledge derived, secrecy requirements based on national security concerns, the protection of personal privacy, and the need to implement technical security measures that protect these values and the integrity of information systems (NRC, 2003). Some research funders also may place restrictions in contracts or grants on the implementation and management of the research they support, which can affect its productivity and impact. Such restrictions include limits on the flexibility of scientists to modify their research goals and approaches, the freedom to pursue unexpected paths and high-risk research questions, the freedom to publish, the involvement of students and postdoctoral fellows in the research, and the nationality of researchers (NRC, 2005a and NRC 2005b). The capabilities of computational SKD tools and technologies and the potential network effects of their application online can greatly magnify not only the benefits they can generate, but those capabilities can be circumscribed or blocked entirely by such restrictions and security measures.
Ø                  Restrictive and permissive technologies and contracts. The respective rights of the producers, distributors, and users of proprietary digital information can now be mediated on a customizable basis with greater flexibility through the use of private contracts and technology protection measures (TPM, also referred to as digital rights management [DRM] tools). Contracts and TPMs can be used to provide both more restrictive or permissive terms of access and use than the terms established by public statutory law, allowing for greater flexibility for the application of SKD tools in the online environment, for example in mining data with hetero-sensitivities. Common-use licenses, such as those developed by the Creative Commons and Science Commons, can play an especially important role in effectively enabling computational SKD approaches that transcend the various boundaries noted above and facilitate complex problem-solving.
Some questions regarding the effects of policies, laws, and rights management tools on computational SKD. How is the application of computer-mediated SKD affected by different regulatory information regimes and how can private contracts, such as common-use licenses, and TPM/DRM controls be used to improve the effectiveness of digital SKD applications for complex, multi-disciplinary research and problem solving? What are the tradeoffs and their direct and indirect effects? What kinds of policy analysis and research are needed to achieve a better understanding of the tradeoffs and their effects? With regard to the federal agency contracts and grants for digital SKD research, what are the principal barriers to progress in SKD research and applications and limitations on complex, multi-disciplinary research collaborations, and what needs to be better understood about them?
Socio-cultural factors.  There are many socio-cultural issues raised by the human-machine interface, and more broadly in the entire SKD process. In some cases, the technological tools and capabilities will outpace the ability of researchers (much less non-experts) to take full advantage of them. There also are socio-cultural barriers to adopting open models for data and information management online. There is thus an inherent lag in the effectiveness of human systems in adapting to new technologies, much less using them easily. Different disciplines and types of research in different organizational settings respond to technological change at varying rates and can involve diverse social and cultural barriers. An improved understanding of the socio-technical and socio-cultural dimensions of computational SKD technologies can be indispensable for the effective and efficient development, deployment, and adoption of this emerging research paradigm, particularly for complex applications that cross various boundaries.
Some questions about overcoming social and cultural barriers to the more effective application of computational SKD in research. Questions that could be addressed at a deeper level may include: What are the social and cultural barriers that affect the successful research and applications of computer-mediated SKD and its enabling infrastructures? What are the social and cultural barriers that affect the successful deployment of automated SKD tools in the context of complex, interdisciplinary research? What research would shed light on and help address these barriers?
*          *          *
There are many questions, in addition to those posed initially above, that need to be answered empirically, or at least with a more complete understanding of available options, to support better informed decisions. As a general matter, there is a lack of data about what does or does not work in computer-mediated SKD and about the enabling online conditions. The metrics and methodologies used are rudimentary or still not defined, and there is insufficient research targeted at illuminating the problem areas and understanding them. An in-depth discussion of the status of evaluation methods and of the research on them with a view to improving our understanding of SKD processes, therefore, is both necessary and timely.
The focus will be primarily on identifying issues and mechanisms that cut across all scientific disciplines, with particular attention to those areas that are of greatest interest to the sponsors of the project where the discovery of knowledge can benefit from the integration of the openly available results and expertise from other disciplines. Examples and case studies will be selected by the project steering committee in consultation with the sponsors to help illustrate the issues raised above and to provide a more effective approach for the discussion. The outcome of this project will provide a synthesis of views by many of the leading U.S. experts in these areas to develop a range of research options focused on promoting the benefits and on reducing the barriers to computational SKD outlined above.
Plan of Action
Statement of Task
A symposium and workshop will be convened at the National Academies to bring together the key stakeholders for intensive and structured discussion in order to obtain a better understanding of the state-of-the-art and future trends in the study of computer-mediated scientific knowledge discovery and to develop a research agenda for future work in this area. Specifically, the project will be performed pursuant to the following statement of task:
1. Opportunities and Benefits of SKD: What are the opportunities over the next 5-10 years associated with the use of computer-mediated scientific knowledge discovery (SKD) across disciplines in the open online environment? What are the potential benefits to science and society of SKD?  
2. Techniques and Methods for Development and Study of SKD: What are the techniques and methods used in government, academia, and industry to study and understand these processes, the validity and reliability of their results, and their impact inside and outside science?
3. Barriers to SKD: What are the major scientific, technological, institutional, sociological, and policy barriers to computer-mediated SKD in the open online environment within the scientific community? What needs to be known and studied about each of these barriers to help achieve the opportunities for interdisciplinary science and complex problem solving?
4. Range of Options:  Based on the results obtained in response to items 1-3 above, define a range of options that can be used by the sponsors of the project, as well as other similar organizations, to obtain and promote a better understanding of the computer-mediated scientific knowledge discovery processes and mechanisms for openly available data and information online across the scientific domains. The objective of defining these options is to improve the activities of the sponsors (and other similar organizations) and the activities of researchers that they fund externally in this emerging research area. 
Project Work Plan
The project will be organized by an ad hoc steering committee of approximately eight individuals representative of the expertise required, including research policy, research information evaluation, information policy, information resources management, information technologies, computer science, information economics, and sociology of science, including ethics/protection of data privacy and confidentiality. Geographic representation and diversity of backgrounds also will be taken into consideration. Selection of the steering committee will be made through consultations with Academy members, National Research Council committees and staff, the sponsors of the project, external experts, and focused databases.
The steering committee will meet three times. The first meeting will take place several months in advance of the symposium and workshop to: meet with the project sponsors to obtain a better understanding of the sponsors’ interests in the project, plan the structure and management of the symposium and workshop, suggest the speakers and other expert invitees, agree on the elements of the two reports, and advise on all other aspects of the project plan.
The second meeting of the steering committee will be convened in conjunction with the symposium and workshop. The members of the steering committee will chair some of the sessions of the symposium as well as the subsequent workshop. The committee will meet immediately following the workshop to discuss the results of the two meetings and to agree on the outline of the report and the schedule for completion. The committee and project staff will hold additional consultations online and by conference call before and after the committee’s two meetings, both to prepare for the symposium and workshop, and to complete the resulting reports. 
The symposium will bring together leading scholars, practitioners, and other experts in government, academia, and industry who are directly involved in the research and applications of computer-mediated SKD and open data and information models to discuss the issues outlined in items 1-3 in the statement of task.
The symposium is expected to have approximately 200 attendees from government, academia, and industry, who work primarily in the areas of research administration, scientific information management and technology, and science policy. The symposium program also will be Webcast, making the discussion accessible to a national (and worldwide) audience, which also will enable the remote participants to submit questions and comments to the speakers by e-mail.
The workshop will involve approximately 40-50 experts, including the steering committee members, many of the symposium speakers, and some other selected experts who have been instrumental in organizing similar symposia or workshops or who have published extensively on these topics. A database of such experts has already been compiled and will be vetted and prioritized by the steering committee. The workshop will focus explicitly on developing a range of research options in computational SKD in the open online environment pursuant to task 4, taking into consideration the issues raised in tasks 1-3 that were presented and discussed during the symposium.
The symposium is expected to be organized according to the following general approach. Day one will be devoted entirely to plenary presentations and panel discussions by academics, government officials, and industry experts The main focus of the presentations and discussions will be on issues in response to tasks 1 and 3. During the morning of the first day a questionnaire will be handed out to the audience attending the symposium and made available online for the remote participants listening to the Webcast. This questionnaire will solicit input by both the physical and virtual audience is response the questions raised in the statement of task, as well as to other related questions developed by the steering committee in advance. The audience will be requested to submit the responses to the questionnaire no later than lunch on the second day, so that the project staff can compile the input and use it for the discussion at the subsequent workshop.
The morning of the second day of the symposium will be devoted to another plenary session where the areas of barriers identified in the statement of task will be discussed extensively by the speakers and the audience. Part of the focus will be on identifying elements of a research agenda for better understanding and addressing the barriers, and also to help inform the discussion in the subsequent workshop. The concluding session of the symposium will be in the early afternoon of the second day and will concentrate on the criteria for determining success in overcoming barriers, as well as future trends and their associated opportunities and threats.
The workshop will start in the late afternoon of the second day in a plenary session in which the invited experts will be briefed on the expected outcome of the workshop and the objectives and methodology of the break-out sessions. A summary of the opportunities and challenges related to computer-mediated SKD based on the symposium discussion and the audience responses to the questionnaire will be presented and discussed to identify any important issues that may have been missed.
On the morning of the third day, the workshop experts will be divided into several (3-4) groups, organized according to the barriers that need to be addressed in developing a range of options for future research in computer-mediated SKD, in response to task 4. Each thematic breakout session will be facilitated by an expert moderator selected by the steering committee and the substance of the discussion will be summarized by a rapporteur.
Symposium and Workshop Products
There will be three published products from the meeting. The first will be an audio Webcast of the entire symposium proceedings, which will archived on the National Academies’ Web site. The second will be an edited collection of the symposium presentations. This report from the symposium will be produced online only. The third product will be a rapporteur’s summary report from the workshop, which will be available in both print and online formats. Both reports will be published by National Academies Press.
Outreach and Communication Activities
Working with the sponsors of this project and consulting various information resources, the project staff will broadly publicize the symposium in advance in order to bring together a large audience of scholars and practitioners in this field from government, academia, and industry. A target audience of approximately 200 participants is envisioned. A variety of outlets will be used, including direct notices to relevant listservs, discussion forums, professional society networks, and the science press. The National Academies Web site and the Web sites of the project’s sponsors will also be used to publicize the event. Prior to the meeting, the National Academies’ Office of News and Public Information will notify journalists about the meeting to encourage their reporting on the symposium proceedings. The meeting also will be Webcast and information about that will be disseminated through the same outlets noted above.
The same process will be used to publicize the release of the final publications. In addition, National Academies Press will use its standard marketing techniques to publicize the reports.
Finally, members of the steering committee and the project staff will report on the results in public fora, such as professional society conferences and other meetings organized by government and academic institutions, and will use the results in planning potential follow-on projects.
Collaboration with Other Organizations
No formal collaborations with other organizations are planned. However, there will be many informal consultations with other knowledgeable groups, both within the National Academies and externally, that are involved in the scientific information sector. Within the National Academies, the project staff will consult with the other various boards and committees involved in data and information management activities and issues. In particular, the project will be coordinated closely with and build upon the results of another related project in this area, on “Overcoming the Technical and Policy Constraints that Limit Large-Scale Data Integration,” that is being organized by the Committee on Theoretical and Applied Statistics and the Government-University Industry Research Roundtable.
With regard to external contacts, a comprehensive list of organizations, publications, meetings, and experts has been assembled already and will be expanded further, both for purposes of speaker invitations as well as for publicity and potential follow up. The project staff also will consult with the sponsors of the project in particular to obtain their ideas about issues to address, people to invite, and groups to contact. Other scientific information management organizations in government, academia, and industry will be consulted as well, including professional societies and organizations working in these areas.
The workshop summary report will be available to the public and widely disseminated, without restriction, including publication on the National Academies’ Web site, as discussed above. The workshop report will be prepared in sufficient quantity to ensure its distribution to the sponsors and other relevant parties, in accordance with the National Academies policy. The symposium proceedings report and the Webcast of the symposium will be made publicly available on the National Academies’ Web site.
Public Information about the Project
In order to afford the public greater knowledge of Academy activities and an opportunity to provide comments on those activities, the Academy may post on its Web site ( the following information as appropriate under its procedures: (1) notices of meetings open to the public; (2) brief descriptions of projects; (3) Committee appointments, if any (including biographies of Committee members); (4) report information; and (5) any other pertinent information.
Estimate of Costs
The total estimate of program costs for the period beginning in October 2007 and ending in November 2008 is $266,500. The full budget is available separately.
Responsible Staff Officer
Paul F. Uhlir, J.D.
Director, International S&T Information Programs
Board on International Scientific Organizations
The National Academies
Washington, DC


Atkins, Daniel, et al. (2003). Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure. Technical Report. Arlington, VA: National Science Foundation.
Hay, B. and K.L. Nance. (2004). IDACT: Automating Scientific Knowledge Discovery. In the proceedings of the Environmental Modeling and Simulation Conference. 22-24/11//2004. Virgin Islands, U.S.
Kostoff, Ronald N. (2005). “Systematic Acceleration of Radical Discovery and Innovation in Science and Technology”, Storming Media, available for a fee at:
Kurabayashi, S., N, Ishibashi, and Y. Kiyoki. (2002). A Multidatabase System Architecture for Integrating Heterogeneous Databases with Meta-level Active Rule Primitives. In the proceedings of Applied Informatics Conference. 18-21/2/2002. Innsbruck, Austria.
Lodge, G.C. (1990). Comparative Business-Government Relations. Englewood Cliffs: Prentice-Hall.
McKinney, Maureen. “The Wisdom of Grids”, Government Health and IT. June 4, 2007.
Mitchell, T. M. (1997). Machine Learning (New York, USA: McGraw Hill).
National Research Council (1995). Preserving Scientific Data on Our Physical Universe: A New Strategy for Archiving the Nation’s Scientific Information Resources. Washington, D.C.: National Academy Press.
National Research Council (1997). Bits of Power: Issues in Global Access to Scientific Data. Washington, D.C.: National Academy Press.
National Research Council (2003). The Role of Scientific and Technical Data and Information in the Public Domain. Washington, D.C.: National Academies Press.
National Research Council (2004). Electronic Scientific, Technical, and Medical Journal Publishing and Its Implications. Washington, D.C.: National Academies Press.
National Research Council (2005a). Engineering Research and America's Future: Meeting the Challenges of a Global Economy, Washington, D.C.: National Academy Press
National Research Council (2005b). Assessment of Department of Defense Basic Research,
Washington, D.C.: National Academy Press.
National Science Foundation (2006). From Cyberinfrastructure to Cyberdiscovery in Materials Science. Simon J.L. Billinge, Krishna Rajan, and Susan B. Sinnott, eds, Arlington, VA.
Reichman, J.H. and Paul F. Uhlir (2003). “A Contractually Reconstructed Scientific Data Commons in a Highly Protectionist Intellectual Property Environment,” Law & Contemporary Problems, 66(Winter/Spring): 315-461.
Spencer, Jennifer (2000). “Knowledge Flows in the Global Innovation System: Do U.S. Firms Share More Scientific Knowledge than their Japanese Rivals?” Journal of International Business Studies, 31(3): 521-561.
Szalay, Alexander, and Jim Gray. Science in an exponential world. Nature, Vol. 440:413. 23 March 2006.
Uhlir, Paul F. (2006), “The emerging role of open repositories for the scientific literature as a fundamental component of the public research infrastructure”, in Open Access: Open Problems, G. Sica, ed., Polimetrica, Monza, Italy.
Uhlir, Paul F. and Peter Schröder (2007). “Open Data for Global Science.” Data Science Journal, CODATA, Paris.

[1] We gratefully acknowledge the generous support of the Office of Scientific and Technical Information
at the Department of Energy for developing this proposal and related background research.