Complex Data Structures in the Health, Social and Environmental Sciences (07w5067)

Arriving in Banff, Alberta Sunday, April 8 and departing Friday April 13, 2007


(Duke University)

(University of California, Berkeley)

Christian Léger (Université de Montreal)

(McGill University)

Jamie Stafford (University of Toronto)

Henry Wynn (London School of Economics & Political Science)



Quantitative tools are an essential component of research in environmental, health and social sciences where the nature of the data that can now be obtained in these fields means that well established and standard research methods and statistical summaries are not adequate. The result has been an explosion of interest in these sciences in obtaining statistical expertise to tackle data analysis problems of substantial complexity, and in training health and environmental researchers in the concepts and methods of advanced statistical analysis. All statisticians and biostatisticians in Canada are familiar with the nearly endless opportunities for collaborative research in these and other areas of science. The dramatic change that is taking place now through NPCDS is the emergence of networks of statistical scientists working on large-scale projects with researchers in the environmental, health and social sciences. This moves well beyond the model of interdisciplinary research at a single university, to a model of national efforts brought to bear on problems of national importance. The NPCDS has been an important catalyst in this transformation. It has altered the cultural environment of Canada's statistical sciences by providing expertise, encouragement, and funding for approaching research collaborations in a national context and by developing international links with statistical research institutes in the US and the EU. This workshop will be a pivotal event in a national effort to build interdisciplinary research capacity with a strong statistical component. Projects within NPCDS will meet to report novel scientific findings, exchange ideas, identify and explore overlap, summarize activity, engage in scientific collaboration, and engage in strategic and visionary planning. The event is timely and will have a profound impact on the future course of NPCDS.

Networking of statistical scientists in collaborations with researchers in environmental, health and social sciences is an international trend. The Statistical and Applied Mathematical Sciences Institute (SAMSI) is a national institute whose vision is to forge a new synthesis of the statistical sciences and the applied mathematical sciences with disciplinary science to confront the very hardest and most important data- and model-driven scientific challenges. The director of SAMSI, Jim Berger, is a member of the board of directors of NPCDS. EURANDOM is a European research institute for the study of stochastic phenomena, in areas such as genomics and proteomics, polymer physics, risk assessment, image processing, communication networks, reliability, logistics and data mining. The scientific co-director of EURANDOM, Henry Wynn, has recently joined the board of directors of NPCDS. The involvement of SAMSI in NPCDS’s activities has had a profound impact on the program and subsequently the Statistical Sciences community at large. Due to the mutually beneficial relations between the NPCDS and SAMSI, it has raised the profile and influence of the program to unexpected levels. It is now common for NPCDS projects have a significant profile in SAMSI’s thematic programs. Indeed, NPCDS can serve as a platform for piloting SAMSI activity. While relations with EURANDOM have only been recently formalized, we envision similar cooperative efforts. The international connections forged this way by NPCDS permit research advances in Europe and the United States to quickly impact interdisciplinary research involving quantitative methods here in Canada, as well as giving NPCDS and Canada's statistical scientists another voice in the international arena. Leading representatives of both SAMSI and EURANDOM will be present at this event.

Scientific Agenda

Many NPCDS projects focus on themes that either complement, or are central to the health and environmental sciences. What follows is a very brief description of the varied research themes that comprise the scientific agenda of this event. Underlying these is a common structure both in statistical methods and data analytic challenges that will unify participants, encourage cross-fertilization and accelerate the research agenda of NPCDS.

Disease Mapping, Surveillance and Global Health:

Understanding why diseases occur is an important step in combatting them, and disease mapping is a powerful data analytic tool for surveillance and public health research. This new discipline emerges from the recent pivotal growth of statistical methods to address data collected in space and time, the possibility for seamless integration of statistical methods in Geographic Information Systems and spatial visualization tools, and rapid information transfer on heath outcomes, including new developments of historical databases. Space-time visualization is extremely useful in a broad range of health-related activities, for example monitoring vaccine- and drug-associated adverse events; monitoring diseases that are transmissible between animals and humans; tracking and flagging potentially dangerous trends and patterns such as an increase in sales of specific overthe-counter drugs, for example, for the treatment of diarrhea; identifying hotspots of activity, such as sexually-transmitted disease. Health geomatics is still in its early stages of development, perhaps especially in Canada, and capacity building in this area will be an important future theme within NPCDS.


The Canadian Consortium on Statistical Genomics is a unique project of NPCDS which was formed in order to maximize the statistical impact on the ongoing genomic revolution. While its primary goals are methodological research and training, its mission is to be guided by the science behind the explosion of genomic data. This project has chosen a collaborative model of operation where research questions arising in the context of a large and complex genomic study motivate the research and guide the training of HQP. The Assessment of Risk for Colorectal Tumours in Canada (ARCTIC) is a large and unique 12M dollar clinical genomic project involving the newest high-throughput genomic platforms (including the 500,000 Affymetrix SNP array), complex epidemiological and clinical subject-level data, and a multitude of challenging research questions that require innovative statistical work and close collaborative research effort. While working closely with the ARCTIC statistical team, the project, and its very unique data is also being used as a platform for research and post-graduate training. The project currently has a postdoctoral fellow and expects to engage 2-4 graduate students in ARCTIC-motivated research.

Data Mining and Drug Discovery:

Data mining is the science of discovering useful, novel, statistically significant patterns in large databases. Research needs in data mining revolve around the creation of new modelling techniques. For example, in drug discovery, statistical models can be used to predict the activity of compounds against biological targets, such as the AIDS virus. By taking as inputs the numeric descriptors of the molecular structure of a compound, these models can “virtually screen” large libraries of compounds. In addition to solving a specific problem of great relevance to the health sciences, members of this team have developed new statistical methodology for a broad class of problems in which a few rare objects, e.g., active compounds, must be identified from a large group of irrelevant ones.

Surveys, Health and Social Policy:

The advancement of health and social policy requires evidence, an important component of which is provided by data from medium and large scale surveys. The majority of national health and social surveys in Canada are carried out by Statistics Canada, including the General Social Survey, the major longitudinal surveys such as the National Population Health Survey and the National Longitudinal Survey of Children and Youth (NLSCY), the cross-sectional Canadian Community Health Survey, and the forthcoming Canadian Health Measures Survey. Health Canada, provincial government organizations, private sector organizations and international consortia are also funding or conducting surveys related to health status, health behaviour, environmental issues and social welfare at an unprecedented rate. At the same time, requirements for the safeguard, maintenance, and organization of the resulting masses of data have led to new concepts in data archiving, such as the Institute for Clinical and Evaluative Sciences.

The potential of these surveys and their data for advancing social policies is fulfilled when the surveys are designed and the data are analyzed by closely collaborating teams of health and social scientists, statisticians, and computer scientists. Funding for such collaboration is beginning to be provided, by such programs as the Collaborative Health Research Projects program of CIHR. Perhaps the most important factor in bringing health and social scientists and statistical scientists together in recent years has been the Research Data Centres (RDCs) program of Statistics Canada, assisted by SSHRC and the CFI, which began in 2000 and has established an ideal set of channels for networking at the “grass roots” level.

Waiting Times and Quality of Care: Performance Evaluation of Public Health Systems:

Under the “Plan for Reporting Comparable Health Indicators in November 2004” by the Advisory Committee on Governance and Accountability (2004), a total of 70 performance indicators (PIs) are expected to be regularly reported by all jurisdictions in Canada to assess the current state of our health care system, primarily in the areas of health status, health outcomes, and quality of service. In addition, the Health Indicators Framework, led by the Canadian Institute of Health Information (CIHI, 2004) and Statistics Canada, contains several dozen other perfomance indicators within the four domains of Health Status, Non-Medical Determinants of Health, Health System Performance, and Community and Health System Characteristics.

The collecting of a large number of possibly related performance indicators from across Canada, while likely essential for identifying specific causes of sub-optimal performance within particular settings, will complicate the task of prioritizing indicators for purposes such as resource allocation. When these indicators are highly correlated, it may be helpful to represent the state of a health care setting by a relatively small set of composite indicators. New research is also needed on the monitoring of performance indicators over time, an area which falls under the domain of statistical quality control theory.

Lifelong Health Initiative:

CIHR Institutes, including the Institutes of Population and Public Health, Aging, Human Development, Child and Youth Health, and Genetics are currently engaged in developing longitudinal cohort studies of the Canadian population under the rubric of the Canadian Lifelong Health Initiative. This initiative encompasses the Canadian Longitudinal Study on Aging, and the proposed Canadian National Birth Cohort. Large population-based studies of this sort raise challenging statistical analyses due to their complex sampling designs and a need to make inferences regarding both the full population and specific sub-populations. Moreover responses relate to complex evolving health outcomes as well as a number of other determinants of health. The development of powerful, flexible statistical models is required to understand and interpret the complex nature of associations among these factors. Critical statistical issues include the development of efficient design and analysis strategies to meet scientific objectives involving genome-wide measures of sequence variation and gene expression, and complex correlation structures among repeated measures through time and familial associations. The success of the Lifelong Health Initiatives and other studies of this sort depend critically on the development of rigorous design and analysis techniques.

Computer Experiments, Dynamic Treatment Regimes and Real Time Interventions During Outbreaks:

The design and analysis of experiments continue to make far-reaching contributions to scientific investigation. Historically, scientists have relied on physical experiments to help understand processes. Now the simulation of complex systems feasible where physical experimentation is too costly, or even impossible to observe. Computer models are frequently deterministic; if a computer trial is repeated with the same inputs to the simulator, the output will be identical. However in many situations, the dimensionality of the input to the computer code can be very large. In others, simulation of the complex phenomena can be computationally very expensive. Still in other applications, the output of the simulator may be data that is a very complex function of the inputs. Common challenges in any of these situations are the related issues of experimental design and the analysis of the output. Choosing a good experiment plan has become more challenging in recent years, as evaluation of complex computer models have required ever more intricate analytical methods, particularly when the data arise from several sources. Applications considered by the research team include experiment for drug discovery and for dynamic treatment regimes, and simulation of epidemic outbreaks, in partnership with industry.

Climate and Agriculture:

The focus of this study is the development of a 10 kilometer daily precipitation grid covering at least the areas of Canada supporting agriculture. Such a grid would be an essential tool in developing strategies for risk management for Canadian farmers. Precipitation data are notoriously difficult to analyze because of their complexity and specifically because they come in two main forms with rather different temporal and spatial distributions, namely sudden storms and blizzards and days of drizzle or snow. Methods for analyzing storm tracks over the prairies are necessarily quite different from those for low pressure ridges coming off the oceans. There is also a need to incorporate several forms of explanatory or collateral variables, such as sea pressure levels, temporal variations in the annual oscillation of the high pressure zone covering the Arctic, and elevation. Existing methods are lacking in many important respects, including the capacity to display the uncertainty in an estimated precipitation value for a particular location on a specific day. This project is a major challenge to existing statistical methodology as statisticians have only rudimentary tools available at this time for working with models of this structure.

Marine Ecology:

In marine ecology complex spatial-temporal data have arisen due largely to new developments in technology and include, for example, animal tracking data, time series observed with high frequency of biological variables from mooring arrays, and satellite based measurements of the ocean ecosystem. The data vary both spatially and temporally, and are usually both noisy and under-sampled with respect to the spatio-temporal processes under consideration. Advancements in marine ecology will be realized using these new observations, but not without the development of new statistical analysis techniques and the clever modification of existing techniques to ensure that the data is used to its full potential. For example satellite telemetry is increasingly being used to model movements of marine animals and birds. In order to understand the observed biological phenomena, it is important that solid statistical methodology be developed to model these typically long distance movements. Low dimensional state space models for animal tracking are under investigation with the aim of understanding the movement dynamics and the interaction with environmental variables.

Forest Ecology:

Canada’s forests form ten percent of the total forested area of the world and play a vital role in our economic, environmental and physical well-being. They are a rich and diverse resource that supports a multi-billion dollar economy and the single largest contributor to Canada’s balance of trade (32 billion dollars in 2004). Research on fire modeling, management and ecological effects has improved fire management and our understanding of forest ecology. However, with the recent pivotal growth of spatio-temporal statistical methodology, with high intensive computing technology, the possibility of seamless integration of statistical methods in GIS and spatial visualization tools, and rapid information transfer on forest fires including new developments of historical databases, there is a strong immediate need to merge expertise which crosses traditional forestry research cultures. The principal vision of this project is to provide a collaborative, statistically and computationally sophisticated environment with a backbone of forestry researchers and agencies that will support the development of spatial data analytic and visualization tools to facilitate fire and forest ecology and forest management.