A Review on Deep Learning Methods for Chest X-Ray based Abnormality Detection and Thoracic Pathology Classification

Backed by more powerful computational resources and optimized training routines, Deep Learning models have proven unprecedented performance and several benefits to extract information from chest X-ray data. This is one of the most common imaging exams, whose increasing demand is reflected in the aggravated radiologi sts’ workload. Consequently, healthcare would benefit from computer-aided diagnosis systems to prioritize certain exams and further identify possible pathologies. Pioneering work in chest X-ray analysis has focused on the identification of specific diseases, but to the best of the authors’ knowledge no paper has specifically reviewed relevant work on abnormality detection and multi -label thoracic pathology classification. This paper focuses on those issues, selecting the leading chest X-ray based deep learning strategies for comparison. In addition, the paper discloses the current annotated public chest X-ray databases, covering the common thorax diseases. Author


Introduction
Among the popular medical imaging exams, the Chest X-Ray (CXR) is frequently requested by healthcare professionals to assess the presence of thoracic diseases, due to its low-cost noninvasive nature. Nevertheless, the thorough analysis of CXR images is time-consuming and their interpretation may be dubious even for expert radiologists (Shaw, Hendry, and Eden 1990). For this reason, the incorporation of computer-aided diagnosis systems in the hospitals is an attractive solution to increase the productivity and efficiency in the interpretation of these exams, by providing a second opinion. Considering the recurring need to assess several types of thoracic pathologies, Deep Learning (DL) based systems have been preferred over traditional machine learning approaches, following the advances in computational capabilities and the increasing availability of medical datasets. This way, the data-driven nature of DL has proved to achieve great performance for multi-disease detection and classification tasks, being a great preliminary diagnostic tool that reduces the physicians' workload. A CXR-based computer-aided diagnosis system encompasses several steps to reach a diagnosis, and perhaps one of the most important is abnormality detection, focused on the prioritization of more urgent abnormal cases. As mentioned in Yasaka and Abe (2018), this would be highly valuable for the clinicians to manage their time and resources, considering that cardiothoracic and pulmonary abnormalities are one of the leading causes of morbidity and mortality, according to Wang et al. (2016). This step could be further complemented with another important task, which is the identification of the pathologies present in the exam at hand. Here, one must consider a multi-label thoracic pathology classification approach, minding that it is possible to have more than one in the same image/patient. The detection and classification of cardiothoracic and pulmonary abnormalities often resorts to Convolutional Neural Networks (CNNs), due to their great ability to handle data with strong spatial relationship. In fact, these networks have been capable of matching or even exceeding human performance in other medical-related tasks, namely the diabetic retinopathy detection (Ting et al. 2017), and skin cancer classification (Esteva et al. 2017). Yet, to the best of the authors' knowledge, no paper addresses a state-of-the-art review based exclusively on deep learning approaches to solve both the thoracic abnormality detection and classification tasks. For this reason, the present publication intends to gather and describe all public annotated CXR databases and analyse how they have been used in the most relevant papers to tackle abnormality detection and pathology classification. Consequently, certain criteria were defined to select the papers from arXiv, IEEE Xplore, PubMed, and Scopus: employing a 100% DL based methodology, published after 2015 to ensure the novelty of the work, which exclusively extracts information from images (and not radiology reports), and presents a relevant study in the field. It was also established that the comparison between the selected papers would be done based on the most frequently observed evaluation metric, the Area Under Curve (AUC). In summary, besides this introduction, the paper includes a review of the CXR datasets in Section 2, abnormality detection in Section 3, and multi-label thoracic pathology classification in Section 4. Finally, Section 5 presents the main conclusions.

Chest X-ray Datasets
Although DL approaches have proven to significantly improve the performance of computer-aided diagnosis systems, it is also noticeable that their distinctive data-hungry nature impairs further achievements. In fact, any achievements made in the recent past were only enabled by the publication of larger public CXR datasets. For this reason, it is still ambitious to say that these systems will soon have a truly large-scale high precision implementation in a real-life clinical domain, considering the challenges tied to the collection and annotation of CXR datasets. For example, Shin, Lu, and Summers (2017) state that it is not clear how to annotate the large amount of CXR images needed for DL methods, particularly ensuring their required precision. Besides, there are multiple approaches to even define the labels themselves, or the criteria to follow during the annotation process. In spite of all these difficulties, several CXR datasets have been published, which can be split into two main groups -the ones which tackle a specific thoracic pathology, and the ones which annotate multiple pathologies. While this paper will briefly describe the first group, the focus of this work are the datasets which encompass more than a single pathology. For instance, the JRST dataset presented in Shiraishi et al. (2000) contains 247 frontal CXR images with and without lung nodules, from 14 medical centers, being one of the first available collections. Jaeger et al. (2014) provided two datasets centered in tuberculosis, named MC and Shenzhen sets. These were collected in the United States and Shenzhen, and contain 138 and 662 frontal CXR images respectively, presenting both normal and tuberculosis cases. Additionally, Ryoo and Kim (2014) also introduced a total of 10848 observations from the Korean Institute of Tuberculosis (KIT). Considering CXR datasets which tackle several pathologies, Gohagan et al. (2000) proposed the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial which resulted in a 13-label partially public set of 185 421 CXR images from 56 071 patients. The Open-I Indiana University dataset was published in Demner-Fushman et al. (2016), collecting 8 121 associated frontal images from two large hospitals in the Indiana Network for Patient Care, and addressing the 10 most prevalent conditions observed in 3 996 subjects. Later on, the National Institutes of Health (NIH) released the ChestX-ray8 in Wang et al. (2017a) and compiled 108 948 frontal views belonging to 32 717 unique patients and a total of 8 associated pathologies (Figure 1) extracted from radiological reports using natural language processing. This dataset evolved to include 6 more categories, increasing the overall number of frontal CXR images and resulting in ChestX-ray14. It is argued that this version is more representative of the patients' distributions and diagnosis in comparison to the previously mentioned set (Wang et al. 2017b). This way, ChestX-ray14 comprises a total of 112 120 images from 30 805 patients and 14 pathologies, and is by far the most popular dataset being used in today's research. Another staple among the most popular CXR datasets is the CheXpert, as seen in Irvin et al. (2019), counting with 224 316 frontal and lateral images and 65 240 patients from the Standford Hospital. CheXpert is distinctive because it not only recognizes the presence of 12 pathology-related classes, but also the presence of medical support devices and fractures, all described in radiology reports that were released along with the images . Finally, PadChest became very recently available in Bustos et al. (2020), containing a total of 193 labels applied to 160 868 frontal and lateral CXR images of 67 625 patients. It was collected from the San Juan Hospital, considering radiology reports written in Spanish. Table 1 focuses on the multiple pathology datasets presented above and summarizes their content. As will be made clear in the following sections, the ChestX-ray14 is undoubtedly the benchmark dataset for CXR-based computer-aided diagnosis. For this reason, there are some important considerations to be made about its limitations. Firstly, the ChestX-ray14 labels were text-mined from radiology reports through natural language processing techniques, and no expert validation was performed to confirm if the final annotations match the image content. Instead, the annotated images were validated using the Open-I Indiana University dataset (F1 score of 0.90). This raises some questions regarding the accuracy of the annotations, namely how accurately these labels reflect the pathology(ies) present in each image. The lack of manual and expert-based verification does not ensure that the positive predictive values of the text-mined ground-truth match the positive predictive values one would achieve with the visual queues. In addition to that, the established labels are not detailed, in the sense that they do not provide information on the expected range of abnormalities beside those 14 pathologies (e.g. pacemakers and invasive lines), and that the "no finding" hypothesis does not guarantee a healthy observation -it simply ensures the absence of those 14 diseases (Yates, Yates, and Harvey 2018). Other issues can be addressed, such as the class imbalance among pathologies, or the relevance between the CXR images and some of the proposed annotations.
To conclude, and while the overall lack of diversity impairs the ChestX-ray14's generalization ability in heterogeneous real-world settings, this dataset has fuelled innovation and research and is considered highly valuable.

Abnormality Detection
Published work in this field has typically favored pathology classification rather than abnormality detection; yet, such detection task can have a high impact when it comes to building a triage system for the CXR images being analyzed. Several approaches can be established to define the automated triage criteria of the patients' images, i.e. which labels to consider. While Tataru et al.
(2017) suggest a more elaborate three label system (normal, abnormal. and emergent), the most common annotations are simply normal and abnormal. It is also possible to address the detection of a specific pathology, as tuberculosis (Sivaramakrishnan et al. 2018), pneumonia (Chouhan et al. 2020), or cardiomegaly (Islam et al. 2017), in which case the abnormal label stands for the presence of the considered condition. However, in this section only the generic normal and abnormal annotations will be considered, and so assuming a binary classification exercise. Current standard off-the-shelf CNN-based methods are frequently applied to detect abnormalities in CXR, and there are several papers which establish a comparison between wellknown architectures, as illustrated in Tang et al. (2020) and Dunnmon et al. (2019). In the first work, the authors consider the AlexNet (Krizhevsky, Sutskever, and Hinton 2017), VGG (Simonyan and Zisserman 2015), GoogLeNet (Szegedy et al. 2014), ResNet (He et al. 2015) and DenseNet , and the ChestX-ray14 dataset. Using transfer learning with pre-trained ImageNet weights (Deng et al. 2009), all CNNs achieved good results, with the DenseNet slightly outperforming the remaining methods. Regarding Dunnmon et al. (2019), which exploits a private database but also uses ImageNet weights, only the AlexNet, ResNet, and DenseNet were assessed for the automated binary triage, where the DenseNet surpassed the other networks. Yates, Yates, and Harvey (2018) used transfer learning on the Inception CNN (Szegedy et al. 2014), retraining its final layer to execute abnormality detection on a mixed of frontal CXR data from the ChestX-ray14 and Open-I Indiana datasets. Besides using transfer learning to reduce the needed computational resources, the authors advise to skip data augmentation, arguing that it is unlikely to result in a reliable representation of any collected real datasets. This way, they gathered the normal Open-I Indiana CXR images (which unlike in ChestX-ray14 guarantee normality) and the 14 pathology examples from the latter as abnormality-positive samples. The CXNet-m1 is presented in Xu, Wu, and Bie (2019) and has a reduced number of convolutional layers in comparison to VGG, ResNet, and DenseNet. Unlike the previous publications, the authors argue against transfer learning in this context due to the dissimilarity between medical images and ImageNet's. Instead, they suggest that ChestX-ray14 is large enough to train a smaller CNN from scratch without time or memory limitations, proposing a hierarchical shorter CNN structure with an improved loss function (sin-loss) to address the information present in indistinguishable features and misclassified images. All these methodologies look at the task at hand as a binary classification problem, but there are alternatives to approach abnormality detection. One of them is considering it a one-class exercise, where the goal is to classify a specific category of data amongst all observations, by primarily learning from a training set containing merely the objects of that class. Tang et al. (2019) adopt this research line and suggest an end-to-end architecture for abnormality detection using generative adversarial one-class learning and ChestX-ray14 (Figure 2). For this reason, the network only takes a normal CXR as input, which go through three main modules: a U-Net autoencoder (Ronneberger, Fischer, and Brox 2015), a CNN discriminator, and an encoder, which compete during the learning task while collaborating for the target task. Considering the model is trained exclusively on normal observations, the adversarial generative model is able to reconstruct a normal CXR, but performs poorly on an abnormal image, thus gaining the ability to distinguish both situations based on the reconstruction differentiation. A one-class autoencoderbased approach is also implemented in Mao et al. (2020), taking normal samples and outputting the reconstructed normal version of the images with an associated pixel-wise uncertainty. This way, abnormal observations in ChestX-ray14 can be identified considering the uncertaintyweighted reconstruction error as a measurement for abnormality presence. Both these publications are valuable in cases where annotating all abnormalities is impractical for large scale training or cannot be obtained (e.g. rare forms of abnormality that are difficult to collect). While Shvetsova et al. (2020) agree that the autoencoders' implicit modelling of more complex data distribution is great for medical abnormality detection, the authors suggest to soften the oneclass assumption. In other words, the authors skip an unsupervised detection where no abnormal observations are taken into account during the model's training, and instead use a limited subset of abnormal images to initiate hyperparameter search and grant the model a more flexible understanding of normality. Consequently, the deep perceptual autoencoder is capable of learning common patterns between normal observations and so accurately restore them, using the perceptual loss function to measure pattern dissimilarity. This works by minimizing the difference between the normalized features of the original and reconstructed images. Also evaluated on ChestX-ray14, the overall framework is represented in Figure 3. Finally, a different approach is proposed in Kieu et al. (2018) to tackle this decision, in which a private dataset goes through three different CNNs simultaneously (Multi-CNNs), represented in Figure 4. One of the networks takes the full CXR image, while the other two take either the left or right half of the same image, to ensure both sides are equally analysed. They all output the probability of normality and abnormality, which are then combined in a fusion rule to compute the final decision.   Shvetsova et al. (2020) propose to train an autoencoder with a limited number of abnormal samples. Note that these results correspond to the best detection experiment in each paper and cannot be directly compared, as they may consider different databases or subsets of the same database. Further information on the data splits for validation and testing of the models can be found in the original publications.

Multi-label Thoracic Pathology Classification
The automatic identification of multiple pathologies in CXR is a much more common exercise in comparison to general abnormality detection. Consequently, there is a higher number of published articles with this particular aim, which often ally the classification with a location task. In such cases, the goal is not only to identify the pathologies present in the image, but also where they appear to be. While most papers presented in this section combine the two aspects, the focus of analysis will be the methodology and performance of their classification task. Nonetheless, it is still relevant to address that several articles seek to interpret their results with heat maps (frequently achieved with class activation mapping) to highlight class -specific regions of images and better demonstrate what the network considered relevant for pathology identification. Additionally, it is also common practice to use transfer learning with pre-trained ImageNet weights to speed the convergence of the classification models. All mentioned papers follow this procedure, unless stated otherwise. As previously introduced, the work presented in this section tackles a multi-label classification exercise, meaning multiple pathologies can be identified in the same image. Perhaps one of the most popular publications for such purpose is the CheXNet's Rajpurkar et al. (2017), which is a classical example of a simple DenseNet implementation. Urinbayev et al. (2020) follow a similar approach to the CheXNet's, incorporating it in a more comprehensive end-to-end diagnosis framework, and claiming to outperform the state-of-the-art by using a more robust version of the Adam optimizer, known as RAdam. This is a variation that provides an automated, dynamic adjustment to the adaptive learning rate. Furthermore, Kumar, Grewal, and Srivastava (2018) apply the DenseNet in a boosted cascaded context without any transfer learning. The authors argue it is able to model complex dependencies among class labels, whilst taking advantage of the boosting strategy during training compared to single classifiers. Gündel et al. (2019) go a step further and use a DenseNet variant to propose the location-aware DNetLoc, which opposes class imbalance with additional weights within the loss function. These weights are tuned based on the label frequency per batch. The ResNet is also a frequent option for image classification, as exemplified in Li et al. (2018).
Here, the authors attempt to classify and locate the pathologies with limited supervision and a single model. More specifically, by slicing the image into a patch grid, the model is able to capture local information on each disease, while at the same time considering information present in the whole image. Alternatively, one can combine the ResNet and DenseNet to build the DualCheXNet by Chen et al. (2019). Its novel dual asymmetric architecture, i.e. with two asymmetric networks depicted in Figure 5, adaptively captures more discriminative features of several pathologies. In other words, since the DenseNet and ResNet capture different and unique features, the network is able to learn complementary details, thus increasing its performance. The two asymmetric feature streams are later combined with a fusion classifier, and evaluated based on a unified loss function, which is a variation of the weighted cross entropy loss with a modulating factor to deal with class imbalance.  2018) tackle the same goals and context by enhancing the DenseNet with squeeze-and-excitation blocks and multi-map transfer. These contribute to boost the model's sensitivity to subtle differences between normal and abnormal regions, and the learning process of disease-specific features, respectively. Zhang, Chen, and Chen (2020) suggest a weakly supervised distance learning framework which, by learning discriminative features among triplets of images, is able to discriminate subtle disease characteristics. As shown in Figure  6, the network considers a pair of images that share the class annotation, and another image which does not. By comparing the unannotated observation (anchor) with images whose pathology is known (positive/negative), the network is able to differentiate the classes by imposing a similarity metric to be lower when image pairs share a similar disease, and higher when there is nothing in common. In addition, the approach also trains a different classifier on region features to verify if the attentive regions contain information indicative of any disease. This leads to another trend called attention learning, where the approaches selectively focus on relevant image regions to assess the presence of the pathologies. Guan et al. (2018) support these methods and defend that irrelevant noisy areas are present during global image training. In Figure  7, the authors provide an example of an attention-guided DenseNet (AG-CNN) with three branches, to learn from both the disease-specific regions, solving the noise issue, and the global image information, avoiding the loss of discriminative clues. Another example is given in the A3Net's triple attention learning strategy (Wang et al. 2021). Here, a model with a DenseNet backbone encompasses three learning modules with channel-wise, element-wise, and scale-wise attention. Each of these grants information on the most discriminative feature channels, regions of interest, and scales, respectively. Moving on to Guan and Huang (2020), category-wise residual attention learning (CRAL) embraces the classification exercise with a class-specific attentive view. This means that the relevance of the features is endorsed by weights based on each category and region, and that these scores are then embedded into a DenseNet's attention blocks to output a final classification. To conclude, Liu et al. (2019) present a contrast-induced attention network (CIA-Net) for disease classification and location based on the contrastive learning of positive and negative observations. In detail, the framework starts by adjusting all images in terms of scale and angles, to take advantage of a highly structured input and so compute a distance between corresponding pixel coordinates in the positive and negative samples. The distances act as an indication of the lesion areas, thus assisting the contrast induced attention branch of the CIA-Net in the final prediction. Note that this particular branch generates attention for every label when analysing a pair of negative and positive images.

Figure 6:
A distance learning based model for thoracic pathology classification and location (Zhang, Chen, and Chen 2020) Considering that all the papers mentioned in this section evaluated their thoracic pathology classification models on the ChestX-ray14, Table 4 presents the 14 labels established for this dataset, along with the metrics achieved by each publication, i.e. their AUC scores per class and mean AUC scores. The mean values are present in the last column, which highlights the three highest scores. All highlighted publications focus on capturing more discriminative characteristics of each pathology present in the images. Guan et al. (2018) identify those subtle features by implementing an attention-guided CNN with three branches, while Chen et al. (2019) do that by combining the ResNet and the DenseNet. Finally, Zhang, Chen, and Chen (2020) opted for a weakly supervised distance learning approach to spot the same indicative attributes. It is important to once again remind that the same data split is not guaranteed, and so it is not possible to establish a direct comparison of the publications. However, one can perceive that there is an overall consistency between the performance values.

Conclusions
Computer-aided diagnosis seeks to provide a second opinion to healthcare professionals, reducing their workload and promoting a more accurate early diagnosis. These systems are particularly important to analyse CXR images containing complex information on a variety of pathologies that affect vital organs. Recent advances in DL strategies and computational resources have led to a steep performance increase in CXR-based computer-aided diagnosis algorithms, which also escalated due to the availability of larger annotated public CXR datasets. The present publication grants a description of the most relevant public annotated CXR datasets, as well as a comprehensive state-of-the-art review on two particular tasks -abnormality detection and thoracic pathology classification. One may notice that all selected papers were published in or after 2017, which is expected because they follow the recent release of the most popular datasets and the prominent DL trend. It is also noticeable that the results published for each task show no significant disparity, i.e. similar performance. In terms of abnormality detection, the leading publications concentrate mainly on standard off-the-shelf CNNs, which can be combined with oneclass learning or fusion rule-based classification. In thoracic pathology classification, besides the same common CNNs, special attention is given to weakly supervised approaches and attention learning. To conclude, this publication provides an overview on the current knowledge on abnormality detection and thoracic pathology identification by describing and comparing a selected set of papers, considered by the authors as the most relevant in the field, in order to promote future research in this area.