Epistemic and Heteroscedastic Uncertainty Estimation in Retinal Blood Vessel Segmentation

Current state-of-the-art medical image segmentation methods require high quality datasets to obtain good performance. However, medical specialists often disagree on diagnosis, hence, datasets contain contradictory annotations. This, in turn, leads to difficulties in the optimization process of Deep Learning models and hinder performance. We propose a method to estimate uncertainty in Convolutional Neural Network (CNN) segmentation models, that makes the training of CNNs more robust to contradictory annotations. In this work, we model two types of uncertainty, heteroscedastic and epistemic, without adding any additional supervisory signal other than the ground-truth segmentation mask. As expected, the uncertainty is higher closer to vessel boundaries, and on top of thinner and less visible vessels where it is more likely for medical specialists to disagree. Therefore, our method is more suitable to learn from datasets created with heterogeneous annotators. We show that there is a correlation between the uncertainty estimated by our method and the disagreement in the segmentation provided by two different medical specialists. Furthermore, by explicitly modeling the uncertainty, the Intersection over Union of the segmentation network improves 5.7 percentage points. Author

help preserve fine details from the input image in the output segmentation mask, such as the edges of the object of interest. Figure 1: We propose a system that segments blood vessels in eye fundus images. Additionally, the system models two types of uncertainty in the prediction: heteroscedastic and epistemic uncertainty However, medical doctors often disagree on diagnosis (Krause et al. 2018;Wanderley et al. 2019) leading to inconsistent annotations that may hinder performance. For instance, in the case of Diabetic Retinopathy grading, specialists agree in only 71% of the images on one dataset with 406 eye fundus images. This problem may be even bigger for segmentation tasks, where each pixel of the image is annotated. For the case of a blood vessel segmentation task, 2 different annotators agree in only 60% of the pixels that were labeled as containing blood vessels by any annotator (Lampert, Stumpf, and Gançarski 2016). This number can reduce drastically as the number of different annotators increases. In the task of segmenting fissures in high resolution images acquired by an unmanned aerial vehicle, 13 different annotators only agreed in 0.6979% of the pixels marked as fissures by any annotator (Lampert, Stumpf, and Gançarski 2016).
To solve this issue, it is common to have images annotated by multiple doctors and then have a committee reach a consensus for each image, but this reduces the total size of the dataset, hence, data variability. One possible solution to this problem is to estimate the uncertainty in the model's predictions (Kendall and Gal 2017). Recently, uncertainty estimation in medical imaging has attracted much interest (Awate, Garg, and Jena 2019;Galdran et al. 2019;Garifullin, Lensu, and Uusitalo 2020;Tanno et al. 2017;Wang et al. 2019). These methods can be divided in two main approaches: domain knowledge and Bayesian approaches. Some methods pre-process the segmentation masks to include an "uncertain" class using domain knowledge. For instance, for the task of segmenting arteries and veins in retinal images, crossings in the vasculature and thin blood vessels can be labeled as uncertain (Galdran et al. 2019). However, most existing methods aim to estimate the uncertainty directly from data without any additional domain knowledge information. Some works model epistemic uncertainty through an approximate Bayesian inference by means of variational dropout ). Other works model the heteroscedastic uncertainty by adding noise to the predictions with an estimated diagonal covariance representing the intrinsic uncertainty (Tanno et al. 2017).
In this work, we model both epistemic and heteroscedastic uncertainty, as shown in Figure 1, without adding any additional supervisory signal other than the ground-truth segmentation masks. The ground-truth is produced by a heterogeneous group of annotators and we show that our combined uncertainty correlates with the disagreement between annotators. Moreover, simply by explicitly modeling the uncertainty during training, we are able to improve the segmentation Intersection over Union (IoU) results by 5.7 percentage points. In summary, our contributions are as follow:  Accuracy: Segmentation results are improved by estimating uncertainty.  Second Opinion: We show that the combined uncertainty is correlated with the disagreement between doctors.  Explainable: The method estimated higher uncertainty near blood vessel edges and on top of thinner vessels.

Method
In this section, we will show how we segment blood vessels in retinal fundus images and how we estimate uncertainty. We will describe how to estimate two types of uncertainty, the epistemic and heteroscedastic uncertainties and, in the end, how to combine them, as shown in Figure 2.

Blood vessel segmentation
A U-Net (Ronneberger, Fischer, and Brox 2015) was used, which consists of an encoderdecoder architecture with skip connections between the encoder and the decoder. The encoder contains 8 Convolutional Layers followed by the ReLU activation function and BatchNorm (Ioffe and Szegedy 2015). Max-Pool is used after every two of these Conv-ReLU-BatchNorm blocks. For the decoder, we use 3 Conv-ReLU-BatchNorm blocks, with bilinear upsampling before each of them, followed by a Convolutional Layer with a single output unit corresponding to the predicted segmentation mask. The model was trained using Adam optimizer (Kingma and Ba 2014) with default parameters. The per-pixel binary cross-entropy loss is used to train the segmentation model f: where x is the input image, y is the ground-truth segmentation mask and i is the pixel index. We then minimize the mean of the per-pixel loss 1 ∑ , where N is the number of pixels in image x.

Epistemic uncertainty
Epistemic uncertainty, also referred to as model uncertainty, accounts for the uncertainty in the model parameters. This type of uncertainty is related to the limited amount of information provided to the model and can be explained away given enough data. We use dropout variational inference (Gal and Ghahramani 2015) to approximate epistemic uncertainty. Dropout with p=10% is added after each ReLU and then, at test time, dropout is also applied in T stochastic forward passes. The epistemic uncertainty can be defined as the predictive variance: is the predictive mean. This uncertainty is reduced when all parameter samples ̂ result in the same prediction.

Heteroscedastic uncertainty
Heteroscedastic uncertainty captures the observation noise in and can not be reduced by gathering more data. For instance, for the task of vessel segmentation, the heteroscedastic uncertainty should be high near badly defined blood vessel edges. To capture this type of uncertainty, we make our model predict the log variance and modify the loss function: By multiplying the binary cross-entropy loss by − , the model is able to identify erroneous or ambiguous labels and ignore them. In order to avoid the degenerate solution of minimizing the loss by simply estimating high uncertainty in all pixels, we add the term. Therefore, the model is optimized to have low uncertainty in all predictions while, at the same time, to ignore labels where the model is likely to have high loss. We used the ELU activation function in the estimation to prevent the model from predicting very large negative values. Finally, we combine the epistemic uncertainty and the heteroscedastic uncertainty. Before combining these two uncertainties we need to make sure they are in the same range, otherwise the uncertainty with larger range could have a bigger weight in the combined uncertainty. We normalize u and to have a minimum value of 0 and a maximum value of 1 in the training set:

Dataset
We evaluate our method on the publicly available DRIVE dataset (Staal et al. 2004). This dataset contains 40 images from different patients with 7 images containing signs of mild diabetic retinopathy. The dataset is equally divided intro train and test sets, with 20 images in each set. Furthermore, the test set was annotated by two different observers, while the train set contains a single ground-truth annotation. This dataset is still one of the most widely used for the blood vessel segmentation task (Imran et al. 2019) and, additionally, as it contains two different annotations for each test set image, it allows us to compare the uncertainty estimation with the disagreement between observers. The images are resized to 512x512px and random translation, scale, rotation and flip operations are performed as dataset augmentation.

Segmentation results
In order to evaluate the effects of modeling the epistemic and heteroscedastic uncertainties on our U-Net model, we start by training a baseline. The baseline model consists of the U-Net as described in section 2.1 and trained with binary cross-entropy. All our models were developed using PyTorch (Paszke et al. 2019). We compare our results with the baseline in Table 1 using three different metrics: Area Under the ROC Curve (AUC), Dice Coefficient and Intersection over Union (IoU). The ROC curve plots the sensitivity and specificity of the model at all classification thresholds and is a standard metric for classification models. The Dice Coefficient and IoU are two standard metrics for segmentation models. The Dice Coefficient doubles the intersection of the predicted and ground-truth segmentation masks and divides by the sum of the areas of the predicted and ground-truth masks. The IoU, as the name implies, divides the intersection of the predicted and ground-truth masks by the union of the two. By modeling both the epistemic and heteroscedastic uncertainties, we are able to improve the performance of the segmentation model in all 3 metrics. The performance improvement is more significant in the Dice Coefficient and IoU as they are more robust to class imbalance. Modeling both epistemic and heteroscedastic uncertainties is better than modeling each of them individually. However, both the epistemic and heteroscedastic versions perform better than the baseline in most metrics.

Uncertainty evaluation
In order to evaluate quantitatively our combined uncertainty, we compared our estimated uncertainty with the annotators disagreement. In this work we define disagreement between annotators as the absolute difference between the two annotations = | ′ − ′′|.
We show that there is some correlation between the estimated uncertainty and the annotators' disagreement in Figure 3. The model tends to estimate high uncertainty close to the boundaries of the blood vessels and on top of thin vessels, which is similar to the places where the annotators disagree. Furthermore, we can see that, in some situations, there is high uncertainty in places where the model did not predict to have blood vessels. These results indicate that it could be possible to extract clinically relevant retinal biomarkers with associated uncertainty that correlates with the disagreement between specialists.

Figure 3:
Comparing the disagreement between annotators and the estimated uncertainty. There is more disagreement between annotators close to the boundaries of the blood vessels and thinner vessels. The estimated uncertainty displays the same behavior. We highlight interesting regions where the uncertainty is similar to the disagreement We evaluated quantitatively the similarity between the annotators' disagreement and the estimated uncertainty. For that, we treat the disagreement as ground-truth and compare each uncertainty map with the disagreement. The results are compiled in Table 2 and show that there is some correlation between the estimated uncertainties and the disagreement.

Conclusions
We proposed a method to estimate uncertainty in eye fundus blood vessel segmentation. We modeled both heteroscedastic and epistemic uncertainty and then combined them into a single uncertainty estimation map. The resulting uncertainty correlates with the disagreement in annotations from specialists, which indicates that our method may act as a second opinion. Moreover, this method learns from heterogeneous annotators as it predicts which pixel annotations are most likely to be annotated differently by medical doctors and includes that information in the loss function. Therefore, it may be possible to eliminate the need of having multiple annotators labeling the same images, and discussing to reach a consensus, allowing the creation of larger and more variable datasets without hindering performance.
In the future, we want to apply these ideas to multi-class segmentation problems, such as the artery-vein segmentation problem in eye fundus images. Additionally, we want to test the robustness of this method to different levels of noise.