Overcoming Barriers to Integrating Heterogeneous Data:
Challenges and Opportunities for Research Communities
Scope and Goals for the National Academies Workshop
Full Project Proposal
Steps Toward Large Scale-Data Integration in the Sciences: Summary of a Workshop (2010)
The National Academies are organizing a cross-disciplinary public workshop to explore alternative visions for achieving large-scale data integration. The workshop will be held August 19-20, 2009 (Wednesday-Thursday) in Washington, DC. The workshop is a joint activity of the Board on Mathematical Sciences and Their Applications and the Policy and Global Affairs Division, with the original impetus coming from discussions of the Government-University-Industry Research Roundtable. This outline of the scope and goals of the workshop has been prepared to facilitate planning and communication with workshop participants.
Focus on the data integration challenges of scientific, engineering, and medical research communities.
Data integration issues are often discipline-specific. For example, the 2005 report Catalyzing Inquiry at the Interface of Biology and Computing provides a glimpse of the promise and challenges of data integration in biology, and some efforts to effect data integration. That report gives a broad overview of the interface between computing and biology, with a focus on open challenges. In contrast, our project will focus more on technologies that are being applied in different fields, so as to compare them and share perspectives, and on examining the policy issues that must be addressed in order to enable effective application of current and emerging technologies.
In order to bound the discussion and produce the most useful outcomes, the workshop will focus on issues related to integrating scientific research data. The communities likely to be covered include physics, biology, chemistry, earth sciences, satellite imagery, astronomy, geospatial data, and research medical data. By and large, this is all “structured data” (i.e., records of fairly rigidly formatted information). In contrast, many data integration efforts outside of research deal with unstructured data (text) and semi-structured data (e.g., want ads, personnel records, etc.). Unstructured data will be relevant only to the extent that they are central to one or more of the target communities. The needs of the intelligence community, health care, and most commercial sectors will not be a focus.
The workshop will examine a collection of scientific research domains, with an application expert explaining the issues in their discipline and current best practices. This approach will allow the participants to gain insights about both commonalities and differences in the data integration challenges facing various communities.
Satellite imagery provides a useful example of community-specific issues. The current state of the art is to transform raw imagery into “cooked” imagery through a pipeline of cleaning and transformation to produce standard data products. Each individual researcher (and there are thousands) wishes to process this cooked data in his/her own desired co-ordinate system, of which there are many, in order to construct and share derived data sets, containing researcher-specific data elements. In addition, researchers often want to “recook” the raw data using other algorithms. Hence, an absolute requirement in this domain is to remember the “provenance” of the data.
Although there are co-ordinate transformation packages available, sharing is most often carried out through ad-hoc arrangements between researchers, and is accomplished with a lot of teeth gnashing and custom one-off code. Relying on such ad-hoc arrangements is inherently inefficient and wasteful, limits the utility of research data, and slows progress in this particular domain. Are there new approaches that might reduce or eliminate these disadvantages in this particular field? Would they have benefits elsewhere in the research enterprise?
In the life sciences, there has been an exponential increase in the volume and heterogeneity of data (e.g., sequenced complete genomes, 3D structures, DNA chips, mass spectroscopy data), much of it made available through over the Web. Data integration advances are needed in order to combine, process, and analyze these data resources in order to uncover patterns and advance discovery.
High energy physics is characterized by enormous experimental facilities whose development exceeds the financial capabilities of individual countries, generating a deluge of non-reproducible data. The ability to integrate and reuse data could advance the field in a number of areas, but the community faces significant barriers due to the amount and complexity of the data, as well as a lack of financial and academic incentives,
Focus on heterogeneous data.
The main focus of this workshop will be on the integration of heterogeneous data. It may be desirable to integrate thousands of data sources; hence “scale” in this context will primarily refer to the number of data sources. Issues related to large (e.g., petabyte-plus) data sets utilized by the participating communities will also be discussed, such as the robustness of algorithms used to process image data, and the implications for the metadata necessary for data exchange and integration.
Overcoming technical and policy barriers
In addition to research domain experts, the workshop will also feature experts working on the cutting edge of techniques for handling data integration problems. This will provide participants with insights on the current technological state of the art.
We anticipate that this discussion will identify several areas in which the emerging needs of research communities are not being addressed, and therefore point to opportunities for addressing these needs through closer engagement between the affected communities and cutting-edge computer science.
The workshop will then proceed to discuss possible approaches to facilitating data integration. The meeting and resulting summary will consider the pros and cons of various “ways forward,” including:
a) Promotion of open and flexible standards. How should various stakeholders work to develop data and metadata standards? Are there domains where a “top-down” approach by sponsoring agencies or other entities would be promising? Are there incentives that could be created to encourage the development and widespread utilization of such standards?
b) Metadata repositories (registries). Where have metadata repositories made a contribution? What is the potential for this approach in the research communities represented at the workshop? What can be done to facilitate the creation and use of these repositories?
c) Federated data bases. Are there fields in which federated databases are a potentially useful approach?
d) Integrating measurements with different space/time scales. How do we overcome differences in vocabularies—particularly for large-scale research problems that may cross disciplines?
The workshop will also discuss policy barriers to widespread data sharing. The most obvious issue is addressing concerns about protecting privacy, and we expect to have a frank discussion about anything that can be done to improve future systems in this area. The workshop will also surface policy issues that could arise due to advances in technologies for data integration—e.g., tightening of restrictions on raw data because of fears that privacy could be compromised if it were later integrated with other data—and illustrate how researchers might forestall policies that are over-constraining.