Database Credentialed Access
High-Resolution Digital Pathology Imaging of Breast Cancer
William Tran , Fang-I Lu , Katarzyna Jerzak
Published: Oct. 2, 2025. Version: 1.0.0
'High-Resolution Digital Pathology Imaging of Breast Cancer' has been released! (Oct. 2, 2025, 3:11 p.m.) This breast cancer tumor biopsy dataset, prepared by Dr. William Tran as part of the Health Data Nexus Dataset Grants, is now available on the platform! Take a look at this fascinating dataset!
When using this resource, please cite:
(show more options)
Tran, W., Lu, F., & Jerzak, K. (2025). High-Resolution Digital Pathology Imaging of Breast Cancer (version 1.0.0). Health Data Nexus. https://doi.org/10.57764/pm3v-b131.
Abstract
This single institution study assembled a breast tumor database comprising clinical data and digital pathology imaging of breast tumor biopsies. All patients were treated with neoadjuvant (pre-operative) systemic therapy (NAT).Background
Breast cancer is the most commonly diagnosed cancer in women globally and remains a leading cause of cancer-related mortality. Many patients with high-risk or locally advanced breast cancer are treated with neoadjuvant systemic therapy (NAT), prior to surgery. A clinical indicator of successful NAT treatment is a pathological complete response (pCR), defined by a complete clearance of invasive and in situ disease in the breast and axillary nodes. Patients who achieve a pCR have a significantly lower risk of breast cancer recurrence and demonstrate better survival outcomes compared to patients who exhibit residual disease after NAT [1,2]. This Health Data Nexus Breast Project comprises clinical and digital imaging data of women treated with NAT (n=157).Methods
1. Health Data Nexus Digital Breast Tumor Pathology Dataset: The dataset comprises clinical and digital pathology samples collected at a single Canadian institution. The dataset contains information on breast cancer patients (n=157) who received NAT between 2013 and 2018. This dataset was designed to support AI research in digital pathology by providing high-resolution whole-slide images (WSIs) of pre-treatment tumor core biopsies.
2. Target Sample Collection and Data Provenance: The dataset containing pre-treatment digitized core biopsies of breast tumors. High-resolution images were acquired to enable tiling.
3. Clinical Attributes and Patient Population: Patient records were reviewed. Individuals were included in this study cohort if they had biopsy-confirmed, unifocal, unilateral, non-metastatic breast cancer and had completed anthracycline- and taxane-backbone NAT. Patients were excluded for the following reasons:
- progressive disease requiring salvage therapies (e.g. neoadjuvant chemoradiation)
- incomplete clinicopathologic data reporting (i.e. diagnostic workup performed at an outside institution)
- non-standard and trial NAT
- treatment non-compliance, or incomplete course of NAT.
4. Tumor Collection and Histological Preparation: Pre-treatment breast tumor core biopsies were extracted according to the institution’s standard of care and following the College of American Pathologists (CAP) guidelines. Breast lesions were radiologically confirmed (mammogram, ultrasound or magnetic resonance imaging). Needle-core biopsies of the lesion were obtained using the TRU-CUT system with a 14-gauge needle. The extracted tissue sample volume measured 0.1 cm in diameter and 1.0 cm to 2.0 cm in length, according to the size of the lesion. Tissue specimens were fixed in formalin and processed for sectioning. Tissue samples were embedded in paraffin, then sectioned into 4 µm (micron) microtomes. Tumor specimens were prepared onto glass slides, then stained with hematoxylin and eosin (H&E).
5. Digital Pathology Samples: Tumor slides were imaged using a dedicated pathology imaging system from Huron Digital Pathology (St. Jacobs, Canada). Slides were scanned at 40x magnification; each whole-slide image (WSI) of the tumor core biopsy had a resolution of ~60,000 x 18,000 pixels (pixel size=0.2μm, i.e., subcellular resolution). All digital WSI were color-calibrated using built-in system settings to correct for any light variances from the optical system.
6. Patient De-identification Process and Quality Check: All samples were anonymized from patient labels (name, postal code, date of birth, hospital file number) through manual and computational methods. First, all slides with patient-related information affixed to the slide label were manually blinded before imaging. Second, the acquired images were reconstructed to isolate the tumor core images alone (discarding the slide background), and all metadata containing patient-related identifiers as input were removed. A non-proprietary tagged image format file was constructed (.tif) and named according to the anonymized study accession number. The digital images are manually inspected by research staff to ensure quality, including removing blurriness, artifacts and physical deformities.
Data Description
The dataset is structured into 157 individual folders labeled “NAC-1” through “NAC-157,” each representing a unique patient case. Each folder contains two files:- a whole-slide image file in Tagged Image File Format (.tif), comprising a digitized H&E-stained core biopsy (may contain multiple cores per patient) scanned at 40x magnification (0.2 μm/pixel), and
- a Microsoft Word document (.docx) that includes a structured table summarizing clinical data along with an embedded representative thumbnail of the corresponding biopsy image.
Usage Notes
This dataset was developed to support the advancement of artificial intelligence (AI) applications in digital pathology, with a focus on predicting response to NAC in high-risk breast cancer. Digital slide viewing of .tif whole-slide images can be opened using digital pathology software such as Sedeen Viewer, OpenSlide, and QuPath. Otherwise, files are stored in standard formats (.tif, .docx, .xlsx) to allow integration into a variety of research applications.Ethics
Co-Operation with Institutional REB: This study was approved by the institutional ethics review board (REB# 270-2018).Conflicts of Interest
The author(s) have no conflicts of interest to declare.References
- P. Cortazar et al. (2014), “Pathological complete response and long term clinical benefit in breast cancer: the CTNeoBC pooled analysis,” The Lancet, vol. 384, no. 9938, pp. 164–172, doi: 10.1016/S0140 6736(13)62422 8.
- G. von Minckwitz et al. (2012), “Definition and impact of pathologic complete response on prognosis after neoadjuvant chemotherapy in various intrinsic breast cancer subtypes,” Journal of Clinical Oncology, vol. 30, no. 15, pp. 1796–1804, doi: 10.1200/JCO.2011.38.8595.
- W. F. Symmans et al. (2007), “Measurement of residual breast cancer burden to predict survival after neoadjuvant chemotherapy,” Journal of Clinical Oncology, vol. 25, no. 28, pp. 4414–4422, doi: 10.1200/JCO.2007.10.6823.
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
Health Data Nexus Contributor Review Health Data License 1.0
Data Use Agreement:
T-CAIREM Data Use Agreement
Required training:
TCPS 2: CORE 2022
Discovery
DOI (version 1.0.0):
https://doi.org/10.57764/pm3v-b131
DOI (latest version):
https://doi.org/10.57764/hprg-9p03
Corresponding Author
Files
- be a credentialed user
- complete required training:
- TCPS 2: CORE 2022 You may submit your training here.
- sign the data use agreement for the project