Datasets

GMCI Dataset Collection

The table below provides an overview of all datasets currently held in the GMCI Zenodo community. Each entry links directly to the corresponding Zenodo record, where full documentation, provenance information, associated files, and a citable DOI are available. The table may be filtered interactively by selecting the desired attributes in the column headers.

Researchers who have produced datasets suitable for graphical modelling or causal inference research are encouraged to submit them to the collection. Submission instructions and the required metadata schema are described on the Contributing page.

External Data Sources for Custom Extraction

A substantial number of research problems in graphical modelling and causal inference require working with data that must be extracted or derived from existing repositories, rather than obtained as a ready-made tabular file. The following sections provide a structured overview of well-established data sources organised by application domain, together with illustrative examples of causal questions they are suited to address. Researchers who derive new datasets through such extraction or post-processing are encouraged to document and submit their finalized materials to the GMCI Zenodo community, in accordance with the applicable licences of the source data.


NoteHealth Data

Health data sources provide rich opportunities for studying causal relationships among clinical variables, genetic factors, and patient outcomes. Questions of interest include, for example, whether early administration of vasopressors improves survival rates in ICU patients with septic shock, how genetic predisposition to obesity mediates cardiovascular risk, and what effect sugar consumption has on diabetes incidence across population strata.

Source License Data Provided
MIMIC-IV (see also PhysionNet) Credential Health Data License 1.5.0 (training required; License) Critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center.
National Health and Nutrition Examination Survey Statistical reporting and analysis only (License) Health of 5,000 adults and children in the US: food, drinks, supplements, nutrients, blood tests.
UK Biobank Need application and ethics approval (License) Genetic and health information from half a million UK participants: Imaging, Genetics, Health linkages, Biomarkers, Activity monitor, Questionnaires, Blood samples
Cardiology: MIT-BIH Arrhythmia Database (see also PhysionNet) ODC-By (License) 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects.
Neurology: OpenNEURO CCO (License) MRI, PET, EEG, iEEG, MEG datasets
Epidemiology: JH Coronavirus Resource Center CC BY 4.0 (License) Global COVID-19 case counts, deaths, and recoveries at the country and state/province level, updated throughout the pandemic

Economic time series are frequently used to study causal relationships among macroeconomic variables, including the effect of monetary policy on inflation, the relationship between government expenditure and labour market outcomes, and the transmission mechanisms of financial shocks.

Source License Data Provided
World Bank Open Data CC BY 4.0 (License) Time Series Data, Performance Indicators
OECD Data CC BY 4.0 (License) Consumer Price Indices, Institutional Investors’ Indicators
Federal Reserve Economic Data (FRED) Non-commercial (License) Economic data series on banking, business, consumer and producer price indices, employment, population, exchange rates, GDP, interest rates, trade, and U.S. financial data.

Survey-based social data are used to study causal and associational relationships in areas such as the effect of working conditions on family stability, the influence of political climate on institutional trust, and the role of socioeconomic background in educational attainment.

Source License Data Provided
General Social Survey (GSS) Purchase required (License) Health, marriage, family, work conditions
Pew Research Data Nonexclusive, non-sublicensable, non-transferable, revocable, worldwide, royalty-free (License) Politics, immigration, religion, technology
European Social Survey (ESS) CC BY-NC-SA 4.0 (License) Public attitudes, beliefs and behavior

Environmental datasets support causal analyses of atmospheric dynamics, land-use change, biodiversity, and climate variability. Representative questions include the effect of atmospheric composition on surface temperature, the causal role of deforestation in regional precipitation patterns, and the relationship between extreme weather events and biodiversity loss.

Source License Data Provided
NASA Earth Data CC0 (License) Climate indicators, human dimensions, atmosphere, biosphere
Copernicus Climate Data Public Domain (License) Satellite earth observation and in-situ (non-space) data
Global Biodiversity Information Facility (GBIF) CC0, CC BY, or CC BY-NC (License) Species occurrence, taxonomic information, habitat data

Political science datasets make it possible to study the causal determinants of policy outcomes, electoral behaviour, and institutional performance. Examples include the effect of government duration on policy impact, the relationship between corruption scandals and public trust, and the role of socioeconomic inequality in election results.

Source License Data Provided
Comparative Political Data Set (CPDS) Information is requested annual political and institutional data, party composition, reshuffles, duration, reason for termination and the type of government for 36 countries
World Justice Project Rule of Law Index CC BY-ND 4.0 (License) Perception of law in the general public in 136 countries: Constraints on Government Powers, Absence of Corruption, Open Government, Fundamental Rights, Order and Security, Regulatory Enforcement, Civil Justice, and Criminal Justice

Further directory of datasets in the US: Congressional Voting Records

Energy datasets are suited to causal analyses of the relationship between policy instruments and energy transition outcomes, including the effect of subsidy programmes on renewable energy capacity, the causal link between fossil fuel prices and industrial investment, and the role of hydrogen expansion in reducing carbon emissions.

Source License Data Provided
International Energy Agency (IEA) CC BY, Non-Commercial (Individual licenses) Electricity access and demand, energy and emissions, hydrogen production, gas statistics, oil prices
Our World in Data – Energy CC BY (License) International energy data, fossil fuels, electricity data, energy mix, energy consumption
U.S. Energy Information Administration (EIA) CC0, CC BY, custom (Individual licenses) Power plants, biofuels, natural gas storage, HGL pipelines, geothermal potential

Additional domains that have generated datasets well-suited to graphical modelling and causal inference research include gene expression data, meteorological records, psychological experimental data, population registers, pharmaceutical trial data, sensor-based time series, transportation and mobility data (see, for example, the U.S. Department of Transportation Open Data and NYC Open Data), education data (see NCES and OECD Education at a Glance), and crime data (see FBI Crime Data Explorer and UK Police Data).