Applications of deep-learning based Histopathological Analysis to predict Cancer Biomarkers
This article is a work-in progress that will grow as I read articles and learn more about the various aspects of deep-learning based cell microscopy and oncology. We will discuss the characteristics and previous works on several problems that deep-learning based approaches are used in these fields.
Disclaimer: I don’t have any sort of degree or experience in medicine/physiology, and the information presented in the article is referenced from the sources, without validation from multiple sources. Thus, I don’t guarantee any of the information and readers should refer to more secondary sources for accurate information. I would appreciate corrections on misinformation of this article.
Table of content
- Tumor Grade(Proliferation score)
- — Mitosis detection
- — TUPAC 16 Challenge
- Immune phenotypes
- PD-1/PD-L1 inhibitor
- Tumor-infiltrating lymphocytes(TIL) therapy
Important biomarkers for understanding the characteristics of a tumor is found by analyzing whole slide biopsies. However, analysis of WSI is a very labor-intensive and often subjective task, even for expert pathologists. Computer vision can be applied to automatically predict biomarkers or improve the consistency and speed of human analysis. Predicting biomarkers involve counting the emergence of a certain type of cell or segmenting the slide image to analyze the proportion of some tissue type.
In deep-learning terms, the main task is semantic segmentation and detection(object detection or instance segmentation) of cell-level instances given a stained whole slide image.
Many important biomarkers for predicting the clinical outcomes of cancer treatments are evaluated by examining whole slide tissue biopsy specimens. However, manual process of histopathological analysis are indeed laborious, time-consuming, and limited by the quality of the specimen and the experience of the pathologist. Thus, a computer vision solution for automating this process has many advantages.
Detecting and segmenting nucleus and cells from high-resolution whole slide image(WSI) scans of stained tissue biopsy specimens is directly linked to how some cell-level biomarkers are assessed. Thus, the problem of cell detection (involves segmentation in most cases) is a core problem in predicting biomarkers from biopsy specimens. There are many approaches to this problem, as well as slightly different definitions of the problem. Some works formulate the problem as instance segmentation of nuclei in such images and propose solutions using semantic and instance segmentation methodologies. Another branch of work formulate the problem as boundary detection.
Deep-learning largely benefits from large scale training datasets. However,  suggests that the limited availability of supervised histopathological datasets can be a bottleneck. Among the publicly available supervised histopathological datasets described in the table above, there are cases where the annotations are not consistent. For example, some datasets were not exhaustively annotated or contains annotations where the borders of the nuclei are not outlined. The differences in staining method and microscopy technology used further complicates the problem of data.
One unique characteristic of nucleus segmentation is that microscopy images often has colossal resolutions and contains extremely many objects. For example, the test set used at Kaggle’s 2018 Data Science Bowl(row 1 in table above) has images containing approximately 100,000 nucleus, which is unimaginable in ordinary object detection or instance segmentation settings. A large throughput could pose computational bottlenecks and limit the model complexity when working on the full image, but it could also be an ideal task for deep-learning because it is incomparably more laborious for human pathologists.
The Multi-Organ Nucleus Segmentation (MoNuSeg) dataset by Kumar et al.(row 6 in table above) include high-resolution WSI of Hematoxylin and Eosin (H&E) stained slides from nine tissue types, digitized in eighteen different hospitals at 40× magnification. The dataset selects densely populated sub-regions extracted from the WSI.
Evaluation and Metrics
Nucleus segmentation is often evaluated using conventional metrics for evaluating semantic segmentation such as the Jaccard index (IoU/mIoU) and metrics for instance segmentation such as mAP.
The authors of  argue that researchers must prioritize on object-level metrics instead of pixel-wise similarity, while I believe this could not be the case every time, based on the specific applications we want to achieve after nucleus segmentation.
From a cell biology perspective, pixel-wise overlap alone is not useful to diagnose the errors that actually impact the downstream analysis, such as missing and merged objects. Thus,we recommend the use of metrics that explicitly count correctly segmented objects as true positives and penalize any instance-level error, similar to practice in diagnostic applications. These include object-level F1-score and false positives, among others.
The authors also propose a quantitive metric for evaluating the performance of single class nucleus detection algorithms. An object is considered detected(correct) if the IoU between the object and a ground-truth object is larger than a certain IoU threshold. Then, an object-level F1 score(Dice score) for a given IoU threshold is calculated. The final output is the average of object-level F1 scores starting at 0.50 up to 0.90 with increments Δt= 0.05. This is to consider the quality of segmentations.
The authors suggest that while this metric is essentially similar to the mAP metric which is widely used for evaluating object detection and instance segmentation performance, it is more convenient to adopt the F1-score in such problems that consider single object categories.
Additionally, the authors suggest measuring the counts of false negatives (missed objects); merges (under-segmentations) which are identified by several true objects being covered by a single estimated mask; and splits (over-segmentations) which are identified by a single true object being covered by multiple estimated masks. All this can be efficiently calculated by computing a matrix C_n×m which contains the IoU score between every true and estimated objects(n: # true objects, m: # estimate objects).
Metrics discussed in  seems to be used often in many research papers and is worth noting the ideas behind it.
Hand-crafted approaches employ a series of hand-crafted preprocessing and algorithmic methods without gradient-based parameter training. They are extremely fast and easy to control compared to deep learning. These tools and methods have been researched and used from decades ago. They can be used as a robust baseline for comparing the performance of more complex learning-based methods.
For example, the FIJI Segmentation Editor by ImageJ shows surprising performance for cell segmentation. More information on this pipeline is described in the references below. There are more open-source packages such as CellProfiler or Ilastik.
How to use FIJI for Cell segmentation: https://www.youtube.com/watch?v=82N-eIPqnwM
However according to , they have limitations due to unavoidable assumptions made in the designs, that doesn’t always hold. For example, thresholding methods that assume bimodal intensity distributions, or region growing that expects clearly separable boundaries. Moreover, they are not powerful enough to detect subtle variations of biologically meaningful phenotypes that are more challenging to segment, of which is increasingly needed for modern biological systems.
We will focus our survey on deep-learning-based approaches for segmentation because they were able to demonstrate state-of-the-art performance by a large margin and seem to have more potential considering the recent advances in deep-learning and computer vision.
The characteristics of the U-Net pipeline are the encoder-decoder architecture and the skip-connection in between as illustrated in the figure above. The encoder-decoder architecture reduces the spatial size of the features and enables efficient computation with a larger number of channels. The skip-connection concatenation prevents information loss during downsampling features. To learn more complex features using powerful CNN architectures, we can replace the encoder network with other modern CNNs such as DenseNets, EfficientNets.
Nowadays, the pipeline is regarded as a general framework for making pixel-wise predictions and is applied in a variety of problems such as image-to-image transfer, image synthesis, pixel-wise discriminators, and more.
The authors of U-Net suggest a network for semantic segmentation of biomedical images which enables strong data augmentation to cope with situations where a large scale dataset is unavailable. They suggest that deformation used to be the most common variation in tissue and simulate realistic deformations as data augmentations.
To separate touching objects of the same class, the authors suggest a very interesting weighted loss method by giving higher priority to narrow borders between close objects. Specifically, a weight map like figure d is generated using the formula below and multiplied to the final loss function.
There are also many recent works on modifications and further improvements of the U-Net pipeline. For example,
Long, F. (2020). Microscopy cell nuclei segmentation with enhanced U-Net. BMC bioinformatics, 21(1), 1–12.
Lagree, A., Mohebpour, M., Meti, N., Saednia, K., Lu, F. I., Slodkowska, E., … & Tran, W. T. (2021). A review and comparison of breast tumor cell nuclei segmentation performances using deep convolutional neural networks. Scientific Reports, 11(1), 1–11.
Semantic segmentation for Instance segmentation(+boundary detection)
Using the weight map method proposed in U-Net with some processing, we can indirectly achieve instance segmentation using the U-Net pipeline trained with semantic segmentation loss. There are also other approaches to perform such separation of objects of the same class. For example, DCAN applies a certain post-processing algorithm to separate touching instances after estimation.
Another strategy is to add a channel in the final layer of the segmentation model to predict the boundaries of the nucleus, which is either provided in the data, obtained by dilating the ground truth segmentation labels. This is learned together with the original semantic segmentation model. We can view this method as simplifying an instance segmentation problem into a 3-class semantic segmentation problem, which is undoubtedly easier to learn.
Ensembling Mask R-CNN and U-Net
Mask R-CNN is a two-stage object detection pipeline that is capable of instance segmentation. In the many works that applied similar instance segmentation frameworks,  compares U-Net and Mask-RCNN in nucleus segmentation and discuss their different strengths and weaknesses.
In particular, the authors suggest that Mask R-CNN is better than U-Net at detecting nuclei but is worse at segmenting accurately. The authors suggest an ensemble model of U-Net and Mask R-CNN to complement the weaknesses of each other. A gradient boosting model was trained to predict the intersection-over union(IoU) between the prediction masks of these models and the ground truth. In test time, the trained ensemble model estimates the IoU for both model’s predictions. The final segmentation map is built using non-max suppression (NMS).
For overlapping masks of U-Net and Mask-RCNN, the mask with highest IoU prediction is used. Non-overlapping masks are added to the output if their predicted IoU is above the threshold. As illustrated in an extreme example presented above this ensemble method could combine the strengths of both models in different situations.
It is very interesting to discuss about the table below, which describes the performance of the three models on multiple metrics. When comparing U-Net with Mask R-CNN, we can say that Mask R-CNN is significantly more accurate in detecting nuclei, as we can observe in the number of over-segmented and under-segmented samples. Thus, Mask R-CNN has better precision and recall. However, the dice score which measures the quality of the segmentation mask indicates that U-Net is able to create more accurate segmentations.
So while the ensemble model only provides slight improvements in terms of mAP, it indeed combines the strength of both models effectively. A more detailed analysis comparing the performance in each dataset is discussed in the paper.
More information on Mask R-CNN: My review on Two-stage object detection
Cancer classification can be used to select the best treatment for the specific type and characteristic of the tumor. When concretely classified, they can act as prognostic factors which predicts the outcome of the disease without treatment, or act as predictive factors which predicts the response to some treatment. Defining the specific subgroup of cancer enables each patient to be treated according to the best evidence available, in terms of Evidence-based medicine (EBM).
Classification of cancer type must be made and validated in a careful and concrete manner, where confounding effects are minimized. Deep-learning based computer vision algorithms have the potential to automate significant parts of the laborious classification pipeline of factors of some cancer types where analysis is usually made based on analyzing the histological appearance of tissue.
Breast Cancer Classification
According to information in Wikipedia, breast cancer classification is the task of dividing breast cancer into categories by considering certain aspects of the tumor. Classification of breast cancer is usually, but not always, based on the histological appearance of tissue in the tumor. Some major categories for defining the type of breast cancer are:
- histopathological type: Based upon characteristics seen upon light microscopy of biopsy specimens. Broadly classified into carcinoma(tumor) in situ or invasive carcinoma.
- the grade of the tumor: Based on the microscopic similarity of breast cancer cells to normal breast tissue, and reflects the prognosis on how fast the tumor divides and spreads
- the stage of the tumor: Determining how much cancer there is in the body and where it is located
- the expression of proteins and genes:
Grade classification(Proliferation score)
Tumor grade is the description of a tumor based on how abnormal the tumor cells and the tumor tissue look under a microscope. It is an indicator of how quickly a tumor is likely to grow and spread.
In a Nottingham system for tumor grading, the tissue sample is classified into three grades: well differentiated (low-grade), moderately differentiated (intermediate-grade), and poorly differentiated (high-grade). There are three morphological features on H&E stained slides that are indicative for breast cancer grading:
- Mitotic count
- Tubule information
- Nuclear pleomorphism
Each category is given 1 to 3 points and added up as the final score which determines the corresponding grade. The surgical pathology criteria of Stanford Medicine also considers these three categories to grade breast cancer. We will look at the details of each task and how they can be automated using deep-learning based computer vision systems.
Additionally, there are more other methodologies to measure tumor proliferation speed and metrics to classify tumor grade. For example, although there are some issues,  suggests that
- Immunohistochemical staining for Ki67 protein
- Molecular gene expression based PAM50 proliferation score
can measure the proliferation score.
TODO: More researching should be done before explaining the details of such alternative approaches for measuring tumor proliferation.
Cancer Grading: Mitosis Detection
Mitotic count is measured as the number of mitotic figures within a certain number of HPF regions. Mitosis counting from WSI images are routinely performed by pathologists, but according to , it is a tedious and subjective task with poor reproducibility. This is because there are many HPFs in a single whole slide and the judgment of mitotic cell is very subjective due to the variation in the appearance of mitotic cells.
Researchers suggest that even among pathologists, it is hard to reach a consensus on mitotic count. For example, the of the 311 cases scored by two human pathologists in the validation dataset of , there was a 78% agreement, which is significantly low in comparison with conventional computer vision dataset. Developing an automated algorithm to detect and count mitosis will not only save time, but can potentially improve the reliability of pathological diagnosis.
Mitosis has four phases called prophase, metaphase, anaphase and telophase. Note that cells in the telophase stage must be counted as one cell because they aren’t completely separated yet.
Some challenges of applying deep-learning for this task is that:
- The data is highly imbalanced as the number of mitotic cells are significantly less than non-mitotic cells.
- Some other cells (like apoptotic cells, dense nuclei) have a similar appearance with mitosis cells.
- The data is weakly labeled in some cases, since consistent annotation is challenging on its own.
*According to information in Wikipedia, High-power field(HPF) denotes the maximum field of view under the maximum magnification power of some microscope. The HPF of different microscopic devices are often different. Often, this represents a 400-fold magnification when used in scientific papers.
TODO: Mitosis Detection Dataset
TUmor Proliferation Assessment Challenge
TUPAC16[link] is a challenge on automatically predicting tumor proliferation. scores by considering relevant biomarkers. Assessment of tumor proliferation directly influences the clinical management of the patient.
The dataset consists of 500 training and 321 testing breast cancer histopathology WSIs and corresponding proliferation scores. Note that one single WSI is very large i.e 1GB, as the full training dataset is 490 GB in size. According to the official task description, the challenge was on two subtasks on automatically assessing tumor proliferation:
- Prediction of proliferation score based on mitosis counting: Tries to predict the grade of tumor by counting the number of mitosis. A commonly used method by human pathologists for assessing proliferation score. The scores are labeled either in 1, 2, or 3 for mitosis counting, thus a classification problem.
- Prediction of proliferation score based on molecular data: Based on the methodologies of , calculates the proliferation score as the mean RNA expression of 11 proliferation-associated genes. Molecular proliferation score mostly correlates with the score based on mitosis counting, but is a more objective metric. Annotations are given as one real number, thus a regression problem. (todo more research)
While the data above was given in selected HPFs units where tumor cells were dense, a practical deep-learning based solution to mitosis counting should be able to perform inference on the WSI instead of from some predefined tumor regions. Two auxiliary datasets were provided to enable inference on WSI images:
- ROI dataset: A dataset of three ROIs indicating where a pathologist might perform mitosis counting based on standard clinical guidelines. Typically regions that have high cellularity and are preferably located at the periphery. There are also further research on cellularity estimation using deep-learning.
- Mitosis detection dataset: Standard dataset of annotated mitosis cells from selected HPFs.
Overview of submitted approaches
According to the authors of the challenge, submitted methods were mostly based on deep CNNs and was categorized into two approaches:
- Train a mitosis detector and assign the proliferation score based on the mitosis count, similar to a manual analysis.
- More direct approach of estimating the proliferation score based on the overall appearance of the ROI.
I though it could be interesting to observe e.g. CAM maps of the direct approaches. Perhaps AI systems could suggest other features that correlates to cancer grade, apart from mitosis counting.
The tables above summarize the approaches of 12 entries that participated in the challenge. Further description of each method should be provided in their paper.
TODO: DeepMitosis, deep learning for mitosis detection…
Cancer Grading: Tubule formation
Cancer Grading: Nuclear pleomorphism
The stage of cancer is identified by the size of a tumor and how far it has spread from where it originated. They assists making appropriate treatment choices. Staging information obtained prior to surgery, for example by mammography, x-rays and CT scans, is called clinical staging and staging by surgery is known as pathological staging, which is more complex but accurate compared to the prior.
TNM staging is a widely recognized metric for breast cancer staging. The system first identifies several TNM factors, T for tumor, N for nodes, M for metastasis, and then groups these TNM factors into overall stages. A short description:
- Primary Tumor (T): Depend on the size and characteristics of cancer at the primary site of origin in the breast.
- Regional Lymph Nodes (N): Depend on the number, size and location of cancer that metastasized to regional lymph nodes
- Distant Metastases (M): Whether the tumor has metastasized to other distant parts of the body
Precise definitions are provided in the linked source. The final stage of cancer has a direct connection with the prognosis of the patient and is determined by a combination of these metrics.
N: CAMELYON 17 Challenge
- CAMLYON 16: https://camelyon16.grand-challenge.org
- CAMLYON 17: https://camelyon17.grand-challenge.org
The CAncer MEtastases in LYmphnOdes challeNge(CAMELYON) 17 mimics the clinical practice of breast cancer metastases detection from multiple WSI of sentinel lymph nodes, which determines the N factor in the TNM system for breast cancer staging.
The data contains a total of 1000 H&E or IHC stained WSI images collected from 5 centers. The WSI images varies in terms of the scanner used, or the magnification level. Each WSI was labeled into macro, micro, or isolated tumor cells(ITC) metastasis based on the size of the largest present lesion in the slide, or negative when no metastasis was present. Patient-level pN-stage labels were also given based on several slides per lymph node. For the purpose of creating a more uniform distribution of pN-stages, the challenge organizers mix artificial patients created by grouping 5 WSIs from different patients.
While the submissions are graded according to the difference in predicted and ground truth pN-stages on the test set of 100 patients, 50 WSI were exhaustively annotated by carefully outlining each lesion in the with polygons, as in the figure above.
The distribution of pN-stages and WSI labels are as the table below.
The top twelve teams followed the same basic algorithmic steps: preprocessing, slide-level classification, slide-level post-processing, and patient-level classification. Details on each method are provided in the challenge paper.
The preprocessing stage identify tissue regions in the WSIs, usually using Otsu’s adaptive threshold on different color schemes(e.g. RGB, HSV, HSI). Some teams further refine the image by using methods such as:
- median filter to remove small regions
- connected component analysis and size filtering
- morphological hole-filling
- TODO more research on such morphological operations
All the participants used common CNN architectures as the backbone network for the classification on WSI images. Team 2 uses a similar framework to the DeepLab models by using dialated convolutions and conditional random fields (CRF) in their pipeline. Slightly different data augmentation methods were also proposed.
Actual metastasis candidates are calculated by thresholding the likelihood maps and post-processing them. One post-processing strategy was to remove small isolated detections that are often false positives.
Assigning slide-level labels directly from the predicted metastasis likelihood map is not the best practice for making conclusion because such methods were observed to have high false positive rates in the prior CAMLYON 16 challenge. Instead, some simple classifier(e.g. random forest) are used to classify slide-level labels given systematically extracted features from the binary detection mask and the likelihood map. Features that were typically used are, for example, the number of detected metastases, mean detection size and standard deviation, mean detection likelihood and standard deviation. Finally, the patient-level pN-stages were mostly determined according to the clinical procedures for defining the cancer stage, although some teams implemented sophisticated, hand-crafted learning-based algorithms to predict the stage.
Immunotherapy is the treatment of disease by activating (activation immunotherapies) or suppressing (suppression immunotherapies) the immune system. Especially, it demonstrated clinical successes for treating various cancer.
Cancers can be characterized by 3 distinct immune phenotypes that describe the level of T-cell activity in some region of tumor. They are categorized as:
- Immune desert: lack of pre-existing immunity
- Immune excluded: some pre-existing immunity, T-cells are at the periphery of tumors
- Inflamed: pre-existing immunity is present at the tumor site
These immune phenotypes provide a foundation for tailoring the right combination of approaches to the right patient. For example,
- PD-1/PD-L1 inhibitors are more likely to respond on inflamed regions because T-cells can infiltrate the tumor but are simply not functioning properly.
- More examples are described in resource by Genentech
Cancer immunity cycle
Checkpoint inhibitor, PD-1/PD-L1
Immune checkpoints are protiens which dampen the response of the immune system. Checkpoint inhibitor therapy tries to restore immune system function by blocking such inhibitory checkpoints. 3 approved checkpoint inhibitors target the molecules CTLA4, PD-1, and PD-L1.
PD-L1 on the tumor cells interacts with PD-1 on T-cells to prevent the immune system from attacking tumor cells. PD-1 and PD-L1 inhibitors act to inhibit the association of the two proteins. According to Wikipedia, this treatment shrink tumors in a higher number of patients across a wider range of tumor types and is associated with lower toxicity levels than other immunotherapies, with durable responses. PD-L1 levels have been found to be highly predictive of whether patients will respond to PD-1/PD-L1 inhibitors.
Predicting PD-1/PD-L1 inhibitor prognosis by PD-L1 expression
An important biomarker for predicting PD-1/PD-L1 inhibitor prognosis is Tumor Proportion Score(TPS): percentage of viable tumor cells showing partial or complete membrane staining. The sample is considered to have PD-L1 expression if TPS >1% and high PD-L1 expression if TPS>50%.
How it is measured:
FDA-approved product for detection of PD-L1 protein in FFPE, non-small cell lung cancer and urothelial carcinoma, PD-L1 IHC 22C3 pharmDx: https://www.agilent.com/ko-kr/products/pharmdx/pd-l1-ihc-22c3-pharmdx-testing
Lunit SCOPE PD-L1 product using AI-based detection: https://demo.scope.lunit.io/pdl1/
TIL, T-cell adoptive transfer(TIL Therapy)
Tumor-infiltrating lymphocytes(TIL) are white blood cells(T cells and B cells) that migrated into a tumor. They are found either within the tumor or in the tumor stroma. They are known to be implicated in killing tumor cells and thus associated with better outcomes when applying surgery or immunotherapy.
TIL therapy is a type of immunotherapy. In short, the procedure involves surgically resecting TILs from the tumor and expanding it. Best TIL lines are selected and further expanded to be infused back into the patient.
AI-based computer vision systems can be applied to estimate the TIL density in the tumor, as a biomarker for predicting the outcomes of TIL therapy. One example is the Lunit SCOPE IO product, which counts the number of lymphoplasma cells within some HPF.
 Lagree, A., Mohebpour, M., Meti, N., Saednia, K., Lu, F. I., Slodkowska, E., … & Tran, W. T. (2021). A review and comparison of breast tumor cell nuclei segmentation performances using deep convolutional neural networks. Scientific Reports, 11(1), 1–11.
 N. Kumar, R. Verma, S. Sharma, S. Bhargava, A. Vahadane and A. Sethi, “A Dataset and a Technique for Generalized Nuclear Segmentation for Computational Pathology,” in IEEE Transactions on Medical Imaging, vol. 36, no. 7, pp. 1550–1560, July 2017, doi: 10.1109/TMI.2017.2677499.
 Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer, Cham.
 Vuola, A. O., Akram, S. U., & Kannala, J. (2019, April). Mask-RCNN and U-net ensembled for nuclei segmentation. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) (pp. 208–212). IEEE.
 He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).
 Caicedo, J. C., Goodman, A., Karhohs, K. W., Cimini, B. A., Ackerman, J., Haghighi, M., … & Carpenter, A. E. (2019). Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl. Nature methods, 16(12), 1247–1253.
 Hao Chen, Xiaojuan Qi, Lequan Yu, Qi Dou, Jing Qin,and Pheng-Ann Heng, “Dcan: Deep contour-aware net-works for object instance segmentation from histology images,”Medical image analysis, vol. 36, pp. 135–146,2017.
 Caicedo, J. C., Roth, J., Goodman, A., Becker, T., Karhohs, K. W., Broisin, M., … & Carpenter, A. E. (2019). Evaluation of deep learning strategies for nucleus segmentation in fluorescence images. Cytometry Part A, 95(9), 952–965.
 Li, C., Wang, X., Liu, W., & Latecki, L. J. (2018). DeepMitosis: Mitosis detection via deep detection, verification and segmentation networks. Medical image analysis, 45, 121–133.
 Elston, C. W., & Ellis, I. O. (1991). Pathological prognostic factors in breast cancer. i. the value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology, 19, 403–410.
 Veta, M., Heng, Y. J., Stathonikos, N., Bejnordi, B. E., Beca, F., Wollmann, T., … & Pluim, J. P. (2019). Predicting breast tumor proliferation from whole-slide images: the TUPAC16 challenge. Medical image analysis, 54, 111–121.
 Nielsen, T. O., Parker, J. S., Leung, S., Voduc, D., Ebbert, M., Vickery, T., … & Ellis, M. J. (2010). A comparison of PAM50 intrinsic subtyping with immunohistochemistry and clinical prognostic factors in tamoxifen-treated estrogen receptor–positive breast cancer. Clinical cancer research, 16(21), 5222–5232.
 Lester, S.C., Bose, S., Chen, Y.-Y., Connolly, J.L., de Baca, M.E., Fitzgibbons, P.L., Hayes, D.F., Kleer, C., O’Malley, F.P., Page, D.L., Smith, B.L., Tan, L.K., Weaver, D.L., Winer, E., Members of the Cancer Committee, College of American Pathologists, 2009. Protocol for the examination of specimens from patients with invasive carcinoma of the breast. Arch. Pathol. Lab. Med. 133, 1515–1538. https://doi.org/10.1043/1543-2165-133.10.1515
 Akbar, S., Peikari, M., Salama, S., Panah, A. Y., Nofech-Mozes, S., & Martel, A. L. (2019). Automated and manual quantification of tumour cellularity in digital slides for tumour burden assessment. Scientific reports, 9(1), 1–9.
 Macenko, M., Niethammer, M., Marron, J.S., Borland,D., Woosley, J.T., Guan, X., Schmitt, C., Thomas, N.E., 2009. A method for normalizing histology slides for quantitative analysis, in: International Symposium on Biomedical Imaging (ISBI). pp. 1107–1110.
 Yi, M., Jiao, D., Xu, H., Liu, Q., Zhao, W., Han, X., & Wu, K. (2018). Biomarkers for predicting efficacy of PD-1/PD-L1 inhibitors. Molecular cancer, 17(1), 1–14.
 Bandi, P., Geessink, O., Manson, Q., Van Dijk, M., Balkenhol, M., Hermsen, M., … & Litjens, G. (2018). From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE transactions on medical imaging, 38(2), 550–560.