Database Credentialed Access

High-Resolution Digital Pathology Imaging of Breast Cancer

William Tran Fang-I Lu Katarzyna Jerzak

Published: Oct. 2, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Tran, W., Lu, F., & Jerzak, K. (2025). High-Resolution Digital Pathology Imaging of Breast Cancer (version 1.0.0). Health Data Nexus. https://doi.org/10.57764/pm3v-b131.

Abstract

This single institution study assembled a breast tumor database comprising clinical data and digital pathology imaging of breast tumor biopsies. All patients were treated with neoadjuvant (pre-operative) systemic therapy (NAT).

Background

Breast cancer is the most commonly diagnosed cancer in women globally and remains a leading cause of cancer-related mortality. Many patients with high-risk or locally advanced breast cancer are treated with neoadjuvant systemic therapy (NAT), prior to surgery. A clinical indicator of successful NAT treatment is a pathological complete response (pCR), defined by a complete clearance of invasive and in situ disease in the breast and axillary nodes. Patients who achieve a pCR have a significantly lower risk of breast cancer recurrence and demonstrate better survival outcomes compared to patients who exhibit residual disease after NAT [1,2]. This Health Data Nexus Breast Project comprises clinical and digital imaging data of women treated with NAT (n=157).

Methods

1. Health Data Nexus Digital Breast Tumor Pathology Dataset: The dataset comprises clinical and digital pathology samples collected at a single Canadian institution. The dataset contains information on breast cancer patients (n=157) who received NAT between 2013 and 2018. This dataset was designed to support AI research in digital pathology by providing high-resolution whole-slide images (WSIs) of pre-treatment tumor core biopsies.

2. Target Sample Collection and Data Provenance: The dataset containing pre-treatment digitized core biopsies of breast tumors. High-resolution images were acquired to enable tiling.

3. Clinical Attributes and Patient Population: Patient records were reviewed. Individuals were included in this study cohort if they had biopsy-confirmed, unifocal, unilateral, non-metastatic breast cancer and had completed anthracycline- and taxane-backbone NAT. Patients were excluded for the following reasons:

  1. progressive disease requiring salvage therapies (e.g. neoadjuvant chemoradiation)
  2. incomplete clinicopathologic data reporting (i.e. diagnostic workup performed at an outside institution)
  3. non-standard and trial NAT
  4. treatment non-compliance, or incomplete course of NAT.
Clinical information was extracted from the institutional electronic medical records. Data included patient age (greater than 18 years), clinical tumor size (largest radiologically reported dimension; mm), clinical nodal status (TNM; confirmed by fine-needle aspiration), Nottingham grade (G1/G2/G3), presence or absence of inflammatory cancer (defined as breast carcinoma with dermal lymphatic invasion), histological type (ductal versus lobular), NAT type, surgery type (Post-NAT), adjuvant radiation dosage, adjuvant systemic therapy type, and survival information. The dataset was curated to represent tumor subtypes. Estrogen receptor status (ER%), progesterone receptor status (PR%), human epidermal growth factor receptor-2 (HER2) status (+/-) were also included in the feature set (ER, PR, and HER2 by immunohistochemistry (IHC), equivocal HER2 status confirmed by in situ hybridization (ISH or FISH). The neoadjuvant drug regimen was recorded for analysis [either dose-dense doxorubicin, cyclophosphamide, and paclitaxel (ddAC-T) or fluorouracil, epirubicin, cyclophosphamide, and docetaxel (FEC-D)]. All patients with HER2+ breast cancers were given trastuzumab anti-HER2 therapy during Taxane chemotherapy, which was included in this dataset. Neoadjuvant treatment response was evaluated using the Residual Cancer Burden Index (RCBI). An RCBI score of 0 (i.e. pCR) was defined as the absence of residual invasive and nodal disease[3]. Patients with residual disease were classified as non-pCR (i.e., RCBI>0). All pathology reports (pre-treatment histopathology and post-NAC synoptic pathology) were evaluated by board-certified breast pathologists and as part of the patient’s standard of care. Similarly, board-certified breast radiologists carried out radiological reporting for diagnostic information.

4. Tumor Collection and Histological Preparation: Pre-treatment breast tumor core biopsies were extracted according to the institution’s standard of care and following the College of American Pathologists (CAP) guidelines. Breast lesions were radiologically confirmed (mammogram, ultrasound or magnetic resonance imaging). Needle-core biopsies of the lesion were obtained using the TRU-CUT system with a 14-gauge needle. The extracted tissue sample volume measured 0.1 cm in diameter and 1.0 cm to 2.0 cm in length, according to the size of the lesion. Tissue specimens were fixed in formalin and processed for sectioning. Tissue samples were embedded in paraffin, then sectioned into 4 µm (micron) microtomes. Tumor specimens were prepared onto glass slides, then stained with hematoxylin and eosin (H&E).

5. Digital Pathology Samples: Tumor slides were imaged using a dedicated pathology imaging system from Huron Digital Pathology (St. Jacobs, Canada). Slides were scanned at 40x magnification; each whole-slide image (WSI) of the tumor core biopsy had a resolution of ~60,000 x 18,000 pixels (pixel size=0.2μm, i.e., subcellular resolution). All digital WSI were color-calibrated using built-in system settings to correct for any light variances from the optical system.

6. Patient De-identification Process and Quality Check: All samples were anonymized from patient labels (name, postal code, date of birth, hospital file number) through manual and computational methods. First, all slides with patient-related information affixed to the slide label were manually blinded before imaging. Second, the acquired images were reconstructed to isolate the tumor core images alone (discarding the slide background), and all metadata containing patient-related identifiers as input were removed. A non-proprietary tagged image format file was constructed (.tif) and named according to the anonymized study accession number. The digital images are manually inspected by research staff to ensure quality, including removing blurriness, artifacts and physical deformities.


Data Description

The dataset is structured into 157 individual folders labeled “NAC-1” through “NAC-157,” each representing a unique patient case. Each folder contains two files:
  1. a whole-slide image file in Tagged Image File Format (.tif), comprising a digitized H&E-stained core biopsy (may contain multiple cores per patient) scanned at 40x magnification (0.2 μm/pixel), and
  2. a Microsoft Word document (.docx) that includes a structured table summarizing clinical data along with an embedded representative thumbnail of the corresponding biopsy image.
File names reflect the breast cancer subtype (e.g., “Luminal_A-1,” “Luminal_A-2,” etc.). Additionally, the dataset includes a centralized “Master_Data_Sheet” in Microsoft Excel (.xlsx) format, which compiles key clinical and pathological variables for all 157 patients in a single tabular file. All materials have been fully de-identified and curated for research use.

Usage Notes

This dataset was developed to support the advancement of artificial intelligence (AI) applications in digital pathology, with a focus on predicting response to NAC in high-risk breast cancer. Digital slide viewing of .tif whole-slide images can be opened using digital pathology software such as Sedeen Viewer, OpenSlide, and QuPath. Otherwise, files are stored in standard formats (.tif, .docx, .xlsx) to allow integration into a variety of research applications.

Ethics

Co-Operation with Institutional REB: This study was approved by the institutional ethics review board (REB# 270-2018).

Conflicts of Interest

The author(s) have no conflicts of interest to declare.

References

  1. P. Cortazar et al. (2014), “Pathological complete response and long term clinical benefit in breast cancer: the CTNeoBC pooled analysis,” The Lancet, vol. 384, no. 9938, pp. 164–172, doi: 10.1016/S0140 6736(13)62422 8.
  2. G. von Minckwitz et al. (2012), “Definition and impact of pathologic complete response on prognosis after neoadjuvant chemotherapy in various intrinsic breast cancer subtypes,” Journal of Clinical Oncology, vol. 30, no. 15, pp. 1796–1804, doi: 10.1200/JCO.2011.38.8595.
  3. W. F. Symmans et al. (2007), “Measurement of residual breast cancer burden to predict survival after neoadjuvant chemotherapy,” Journal of Clinical Oncology, vol. 25, no. 28, pp. 4414–4422, doi: 10.1200/JCO.2007.10.6823.

Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
Health Data Nexus Contributor Review Health Data License 1.0

Data Use Agreement:
T-CAIREM Data Use Agreement

Required training:
TCPS 2: CORE 2022

Discovery

DOI (version 1.0.0):
https://doi.org/10.57764/pm3v-b131

DOI (latest version):
https://doi.org/10.57764/hprg-9p03

Corresponding Author
You must be logged in to view the contact information.

Files