Database Credentialed Access

GIM, a dataset for predicting patient deterioration in the General Internal Medicine ward

Sebnem Kuzulugil Chloe Pou-Prom Muhammad Mamdani Joshua Murray Amol Verma Kaiyin Zhu Michaelia Banning

Published: Oct. 21, 2022. Version: 1.0.0 <View latest version>

When using this resource, please cite: (show more options)
Kuzulugil, S., Pou-Prom, C., Mamdani, M., Murray, J., Verma, A., Zhu, K., & Banning, M. (2022). GIM, a dataset for predicting patient deterioration in the General Internal Medicine ward (version 1.0.0). Health Data Nexus.


The Data Science and Advanced Analytics (DSAA) team at Unity Health Toronto has developed and evaluated advanced patient monitoring and decision support systems to improve the efficiency, accuracy, and timeliness of clinical decision-making on the General Internal Medicine (GIM) inpatient ward at St. Michael’s Hospital. The GIM dataset was created through this work, and is comprised of de-identified health related data associated with over 22,000 patient encounters for 14,000 unique patients who were admitted under the GIM service at St. Michael’s Hospital between 2011 and 2019. The dataset was sourced from three distinct systems (Electronic Health Records, the Admit Discharge Transfer System and the Medication Administration Check System). Pre-processed datasets aggregating observations into fixed time windows are provided for convenience. A raw untransformed data set is also provided for researchers who wish to apply their own data transformations and includes demographics and outcome tables from the processed data. Patient outcomes available include ICU transfer, death, palliative entry, palliative discharge, and hospital discharge.


Modern General Internal Medicine (GIM) departments employ an impressive array of technologically sophisticated instrumentation to provide detailed assessment of the pathophysiological state of each patient. Ideally, such monitoring permits the early detection of changes in the patient's condition and provides information that both supports therapeutic decision-making and assists in evaluating the response to treatment. However, providing patient care is becoming an increasingly complex task because of the growing volume of relevant data from clinical observations, bedside monitors, and a wide variety of lab tests. Furthermore, the enormous amount of electronic health record data and its poor organization makes its integration and interpretation time-consuming and inefficient, and yet has also created “information overload”, which may lead to errors and mishaps in patient care. Retrospective research using data gathered from GIM patients could drive improvements in their care which would ultimately benefit society at large.


Data was extracted from the following source systems:

  • Admit-Discharge-Transfer (ADT) System: Identify patient encounters under the GIM service.
  • Electronic Medical Records (EMR): Demographics, laboratory results, clinical orders, vitals and ICD-10 codes.
  • Medication Administration Check (MAK): Documentation for all inpatient medication administrations, including the type of medication, dose, timing, administration route, and administration timestamp.

All data was de-identified prior to publication. The following de-identification steps were taken:

  • Patient identifiers, addresses, postal codes, and names were removed from the data.
  • Each patient encounter was assigned a random 6 digit integer.
  • Any variable containing the year or month was removed from the data, and only elapsed time relative to a patient time zero was retained.
  • Patient ages were rounded to the nearest 5 years.
  • Patients with ages below 20 were set to 20, and patients with ages above 90 were set to 95.

The dataset is provided in its original form as well as in a pre-processed form which aggregates data into fixed time windows. Time-varying data was binned into 8-hour windows starting from the time the patient entered the GIM ward and ending at the time of discharge. Numeric data were averaged within the 8-hour windows. Auxiliary variables were added for: (1) whether a value was measured during the window, and (2) the time since the last measured value. Numeric data was then trimmed and normalized between 0 and 1. For orders, a Boolean value was added which indicated whether or not the order was active during that window. Medications were grouped into classes and an indicator variable marked administration within the window. Finally, missing data was imputed using last observation carry forward for numeric data. For other data, such as medication administration, a value of zero was imputed.

Data Description

The General Internal Medicine (GIM) dataset is comprised of de-identified health-related data associated with over 22,000 patient encounters for 14,000 unique patients who were admitted under the GIM service at St. Michael’s Hospital between 2011 and 2019. All patients admitted under a GIM service with an admission of at least 30 hours were included.

The top occurring/most important variables were selected in consultation with a GIM staff physician. Variables included in the dataset include vital signs, laboratory measures, shift assessment variables, fluid intake values, fluid outtake values, and disease-specific values. Clinical orders include requests for imaging, telemetry, consults, cardiology, dietary measures, respiration, activities, codes, protocols, transfusions, wound care, and neurological care. Medications are grouped into AHFS classes. Demographics available include age, sex, marital status, language, and religion.

A detailed description of the dataset can be found in the GIM Documentation Dashboard. The GIM Documentation Dashboard includes a detailed reference guide to all data tables (including a data dictionary and the scheme for each table), as well as a detailed explanation of each data table (including plots summarizing key information), tutorials that describe basic introductions for accessing and using the data, and how-to guides that provide more information for performing specific analyses with the data.

Usage Notes

The dataset is comprised of relational tables and is best analyzed using software capable of merging individual data files. The Tutorial page of the GIM Documentation provides literate programming notebooks describing loading and analyzing the dataset in Python.

Release Notes

GIM v1.0 is the initial release of the database.


This project has been approved by Unity Health REB, protocol #21-206.


The authors would like to thank St. Michael's Foundation Angels Den Innovation Award (2018) for their support of this project.

Conflicts of Interest

The authors Chloe Pou-Prom, Amol Verma, Muhammad Mamdani, Joshua Murray and Sebnem Kuzulugil hold non-controlling shares in Signal1 AI.


  1. Verma AA, Murray J, Greiner R, Cohen JP, Shojania KG, Ghassemi M, Straus SE, Pou-Prom C, Mamdani M. Implementing machine learning in medicine. CMAJ. 2021 Aug 30;193(34):E1351-7.
  2. Nestor B, McCoy LG, Verma A, Pou-Prom C, Murray J, Kuzulugil S, Dai D, Mamdani M, Goldenberg A, Ghassemi M. Preparing a clinical support model for silent mode in general internal medicine. In Machine Learning for Healthcare Conference 2020 Sep 18 (pp. 950-972). PMLR.


Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
Health Data Nexus Contributor Review Health Data License 1.0

Data Use Agreement:
T-CAIREM Data Use Agreement

Required training:
TCPS 2: CORE 2022
Health Data Nexus Data User Code of Conduct Training
ISED Cybersecurity Training for Researchers

Corresponding Author
You must be logged in to view the contact information.