Skip to Main Content
The National Academies of Sciences, Engineering and Medicine
Board on Mathematical Sciences and Analytics
Board on Mathematical Sciences and Analytics
About BMSA
Committee On Applied & Theoretical Statistics
Math Frontiers Webinar Series
Data Education Roundtable
Member Bios
BMSA & CATS Impacts

Symposium on Statistical Issues in Data Acquisition

Lawrence Berkeley National Laboratory
July 16, 2004


8:00-8:30 a.m.
Continental Breakfast

8:30-8:45 a.m.
Welcome and Overview of Sessions

Bob Jacobsen
Lawrence Berkeley National Laboratory

Karen Kafadar
University of Colorado at Denver
Committee on Applied and Theoretical Statistics (CATS)

Session One: Capturing & Analyzing Data in Fundamental Physics

8:50-9:10 a.m.
Introduction to High Energy Physics (HEP) Data
Bob Jacobsen, Lawrence Berkeley National Laboratory

9:10-9:25 a.m.
The Future of HEP Data at the Large Hadron Collider
Paolo Calafiura, Lawrence Berkeley National Laboratory

9:25-10:00 a.m.
Statistical Issues in Cosmology
Julian Borrill, Lawrence Berkeley National Laboratory

10:00-10:45 a.m.

10:45-11:00 a.m.


Facilities Tours and Discussion

11:00 a.m.-12:30 p.m.
Tours (in parallel)

Advanced Light Source Facility (ALS)
National Energy Research Scientific Computing Center (NERSC)

12:30-1:30 p.m.


Session Two: Data Acquisition & Analysis in the Earth Sciences

1:30-1:50 p.m.
General Architecture of EOSDIS Data Capture and Storage
Amy Braverman, Jet Propulsion Laboratory

1:50-2:10 p.m.
Issues Arising from Data Pre-Processing Prior to Analysis
Wendy Meiring, University of California at Santa Barbara

2:10-3:15 p.m.

3:15-3:30 p.m.


Session Three: High Performance Computing

3:30-3:50 p.m.
Long-Running Simulations on High Performance Computers
George Ostrouchov, Oak Ridge National Laboratory

3:50-4:20 p.m.


Wrap-up Discussion

4:20-5:00 p.m.
Moderated by Ed Wegman
George Mason University
Committee on Applied and Theoretical Statistics (CATS)

6:00-7:30 p.m.


Statistical Issues in Data Acquisition: Workshop Summary

In today’s information age, scientists rely chiefly on statistical modeling and analysis in order to manage massive amounts of data. Data acquisition is one realm where the use of these techniques can be used in order to aid scientists in capturing and parsing data. Specifically, researchers often ask “How can data be collected in order to discover unanticipated information?” In an effort to address this and other questions involving data acquisition, the Committee of Applied and Theoretical Statistics (CATS) of the National Research Council held a day-long workshop hosted at Lawrence Berkeley National Laboratory (LBL) on July 16, 2004. Statisticians and scientists from fields such as high-energy physics, earth science, and high performance computing discussed statistical techniques and methodologies from their research and highlighted current problems and solutions.

Robert Jacobsen, a high-energy physicist at LBL, opened the workshop with a presentation of the standard model of particle physics. In this model, physical events are constructed by chains of particle interactions, and these chains are constructed backwards through statistical inference. However, Dr. Jacobsen stated that most physicists are not trained in uncertainty analysis, and therefore “observations that fit don’t always make the cut because throwing them out lowers uncertainty”.

A number of statistical challenges arise because of a lack of distinction between “data collection” and “data processing”. Julian Borill, an LBL cosmologist, explained that while many large, data collecting organizations—such as NASA and NSF’s ground-based facilities—publish their data for scientific use, there is generally no specification of what, if any, refinements have been made to the raw data. On one hand, scientists may unintentionally use data that has been made biased by certain types of pre-processing, and on the other hand, scientists may find that without any processing, data sets can be too large or noisy to be useful. Wendy Meiring, a statistician at the University of California at Santa Barbara, echoed a similar concern. In her work with ozone data, there is debate over revising and re-releasing data in an effort to correct for deficiencies that were discovered years later in the data collection instruments.

Many workshop participants expressed the need for data owners to work with statisticians during the data gathering process. Amy Braverman, often the only statistician among her collaborators at NASA’s Jet Propulsion Laboratory, suggested that within the simulation and modeling community, there is often a misconception that statisticians work entirely on error estimates and are not seen as experts in variation. One workshop participant commented that a reason for this lack of communication is that statisticians tend to define themselves by methodology (Bayesian, etc.) as opposed to data type (climate data, for example). Others agreed that if statisticians organized themselves according to application area, it would help to bridge the gap between the statistics and non-statistics communities.

Two workshop attendees discussed current high-performance computing tools available for data acquisition. George Ostrouchov, a computer scientist at Oak Ridge National Laboratory, gave an overview of DOE’s Scientific Discovery through Advanced Computing (SciDAC) program. This program aims to develop software and hardware needed for terascale computers to run PDE-based finite element simulations. Jogesh Babu, a statistician at Penn State University, introduced the workshop participants to, a web based service providing a suite of statistical tools designed for astronomers working with large data sets. The VOStat project is a joint effort led by Penn State University in collaboration with Carnegie Mellon Univesity and California Institute of Technology.

Overall, the participants found the workshop discussions useful and a number of participants discussed the possibility of future collaborations.

Back to top of page

Workshop Participants

AMY BRAVERMAN, Jet Propulsion Laboratory
JOGESH BABU, Pennsylvania State University
PETER BICKEL, University of California at Berkeley
PAOLO CALAFIURA, Lawrence Berkeley National Laboratory
MARK FITZGERALD, University of Colorado at Denver
ROBERT JACOBSEN, Lawrence Berkeley National Laboratory
KAREN KAFADAR, University of Colorado at Denver
WENDY MEIRING, University of California at Santa Barbara
STEVE SAIN, University of Colorado at Denver

JULIAN BORRILL, Lawrence Berkeley National Laboratory
GEORGE OSTROUCHOV, Oak Ridge National Laboratory
ED WEGMAN, George Mason University