Database Credentialed Access
COVID-19 Hospital Demographic, Clinical and Outcome Dataset
Farbod Abolhassani , Alexander Bilbily , Benjamin Fine
Published: Feb. 23, 2023. Version: 1.0.0 <View latest version>
When using this resource, please cite:
(show more options)
Abolhassani, F., Bilbily, A., & Fine, B. (2023). COVID-19 Hospital Demographic, Clinical and Outcome Dataset (version 1.0.0). Health Data Nexus. https://doi.org/10.57764/3nza-5j34.
The AI Deployment and Evaluation (AIDE) Lab at the Institute for Better Health (IBH), Trillium Health Partners (THP) was involved in a Supercluster Canada-funded multi-centre study that collected a dataset containing key demographics and clinical parameters of hospitalized patients diagnosed with COVID-19. The dataset is comprised of de-identified health related data associated (demographics, clinical parameters such as vital signs and laboratory values, imaging findings and hospital outcomes) of a cohort of 509 patient visits admitted to THP via the Emergency Department between November 1, 2020, to March 15, 2021 meeting criteria for likely COVID-19. The dataset was sourced from three distinct systems (The THP Enterprise Data Warehouse, Electronic Health Records, the Picture Archiving and Communication System). This dataset was combined with data from 2 other Ontario sites and used to develop a machine learning model to predict outcomes of patients, specifically death and discharge.
The purpose of this project was two-fold. Firstly, it involved producing a large, feature rich dataset and making it available to the global community of innovators and researchers, with the goal of encouraging the creation of valuable insights to fight against COVID-19. The second purpose was centered on utilizing the carefully curated dataset to develop a machine-learning clinical prediction tool to aid in management of hospitalized COVID-19 patients.
Patient Cohort Definition
Patients who were admitted to Trillium Health Partners (THP) via the Emergency Department between November 1, 2020, to March 15, 2021 with the following criteria were included: a positive COVID-19 test within the first 72 hours of admission; admitted to one of the wards designated for COVID-19 patients or the ICU; and a COVID-19 like admitting diagnosis. Patients under 18 years of age, patients who transferred to a non-THP acute care setting and patients discharged after March 30 were excluded.
The common data model was defined centrally for all 3 sites. Observational data collected routinely during medical care at THP were mapped to the data model though expert consultation (Clinical Informatics Analyst, an EHR consultant and internal Business Intelligence advisors) A method for extraction and transformation was identified for each data element from one of 3 source system (Enterprise Data Warehouse, The Electronic Health Record, or PACS) . Both programmatic and manual chart abstraction were required.
Data elements (n=74) were extracted using two different methods:
- 58 data elements were programmatically extracted from the electronic health record (EHR) and picture archiving and storage (PACS) system by a Data Engineer, an EHR consultant, and a Business Intelligence advisor
- 16 data elements were extracted by chart abstraction from the electronic health record by two Research Associates
Three approaches were adopted to identify potential data veracity issues:
- The authors explored summary statistics of the data elements and compared against clinician derived appropriate ranges. Any values outside these ranges were considered outliers. Two clinicians reviewed the list of outliers and flagged values for further investigation. A research associate verified the flagged outlier values against the patient chart in EPIC Hyperspace. If the values were confirmed to be incorrect, the research associate manually resolved the error.
- The authors selected a 5% random sample of values for each data element for validation. A research associate compared the value in the dataset against the value in the EHR. Any discrepancies were documented in excel and resolved through the web portal. In the case of data elements collected through chart abstraction, a second Research Associate reviewed the data elements and compared to the EHR. Any discrepancies were documented in excel and resolved.
Following completion of the validation stage, the authors verified an additional 5% of all data elements by comparing the values in the web portal with the values from the patient chart in the EHR to ensure no issues remained.
- One row represents one unique visit for both CSV and ndjson files
- Data includes but is not limited:
- A unique patient identifier to link on the other tables.
- Patient demographics: age and sex. The "age" field is grouped in 5 year categories between 20 and 95, with any patients over the age of 89 being put in the same age category with a value of 90. The "sex" field contains Male and Female as entries.
- Patient comorbidities (extracted from clinical notes)
- Patient medications
- Lab and vital signs upon admission.
- One row represents one day of a patient's stay at the hospital.
- Data includes but is not limited to:
- A unique identifier for each patient-day
- Patient ID (named parent_id)
- Daily labs
- Imaging studies ordered and corresponding findings
- Data is present for days 1 through 7, and day 14
- One row represents one unique visit
- Data includes but is not limited to:
- Patient ID
- Hospital length of stay
- Number of days on mechanical ventilation
- Visit outcome (including death)
- Cause of death (if applicable)
Data dictionaries are provided for admission, daily and outcome data. Data dictionaries provide detailed information for each field in the corresponding data files, including units and useful notes.
The dataset is comprised of multiple files which interrelate and is best analyzed using software capable of merging individual data files.
The study was approved by the Trillium Health Partners Research Ethics Board (REB ID#1044) with administrative review by the University of Toronto REB.
The research project for which this dataset was created was a collaboration with 16Bit Inc. and funded by the Digital Technology Supercluster. The authors thank Morgan Lim, Jonathan Ranisau, and Mark Cicero for their contributions.
Conflicts of Interest
Alex Bilbily and Mark Cicero are shareholders of 16Bit Inc. There are no other conflicts of interest to declare.
Only credentialed users who sign the DUA can access the files.
License (for files):
Health Data Nexus Contributor Review Health Data License 1.0
Data Use Agreement:
T-CAIREM - Trillium Health Partners Data Use Agreement
TCPS 2: CORE 2022
- be a credentialed user
- complete required training:
- TCPS 2: CORE 2022 You may submit your training here.
- sign the data use agreement for the project