Uncertainty Estimation by Fisher Information-based Evidential Deep Learning Danruo Deng 1 2 Guangyong Chen 3 Yang Yu 1 2 Furui Liu 3 Pheng-Ann Heng 1 2 Abstract Predictive uncertainty is quite diverse and can be divided into data uncertainty, model uncertainty, and distributional uncertainty (Gal, 2016; Malinin & Gales, 2018). Data uncertainty, or aleatoric uncertainty, is caused by the natural complexity of the data, such as class overlap and label noise. Model uncertainty, or epistemic uncertainty, measures the uncertainty in estimating model parameters given training data. Model and data uncertainty are sometimes referred to as reducible and irreducible uncertainty, respectively, since model uncertainty can be reduced with more training data, while data uncertainty cannot. Distributional uncertainty arises from a distribution mismatch between the training and test distributions, i.e., the test data is out-of-distribution (OOD) (Quinonero-Candela et al., 2008). Quantifying the different uncertainty is obviously a key factor that makes deep learning reliable. Uncertainty estimation is a key factor that makes deep learning reliable in practical applications. Recently proposed evidential neural networks explicitly account for different uncertainties by treating the network’s outputs as evidence to parameterize the Dirichlet distribution, and achieve impressive performance in uncertainty estimation. However, for high data uncertainty samples but annotated with the one-hot label, the evidencelearning process for those mislabeled classes is over-penalized and remains hindered. To address this problem, we propose a novel method, Fisher Information-based Evidential Deep Learning (IEDL). In particular, we introduce Fisher Information Matrix (FIM) to measure the informativeness of evidence carried by each sample, according to which we can dynamically reweight the objective loss terms to make the network more focus on the representation learning of uncertain classes. The generalization ability of our network is further improved by optimizing the PAC-Bayesian bound. As demonstrated empirically, our proposed method consistently outperforms traditional EDL-related algorithms in multiple uncertainty estimation tasks, especially in the more challenging few-shot classification settings. Softmax is the most widely used normalization function that maps the continuous activations of the output layer to a probability distribution. However, Softmax is notorious for inflating the probabilities of predicted classes (Szegedy et al., 2016; Guo et al., 2017; Wilson & Izmailov, 2020). Some methods calibrate network predictions to improve the reliability of uncertainty estimation (Guo et al., 2017; Liang et al., 2018). However, these methods can still not distinguish between different types of uncertainty, which seriously limits the practical usage of deep learning in challenging domains without enough training samples. Recently notable progress has been made in estimating uncertainty in DNNs. One class of methods stems from Bayesian neural networks, which quantify uncertainty by learning a posterior over weights (Gal & Ghahramani, 2016; Ritter et al., 2018; Kristiadi et al., 2022). Other methods often combine predictions from several independently trained networks to estimate statistics for class probability distributions (Lakshminarayanan et al., 2017; Zaidi et al., 2021). However, these approaches still cannot distinguish distributional uncertainty from other uncertainties (Malinin & Gales, 2018). 1. Introduction Uncertainty estimation is crucial not only for safety decisionmaking in high-risk domains such as medical image analysis (Seeböck et al., 2019; Nair et al., 2020) and autonomous vehicle control (Feng et al., 2018; Choi et al., 2019), but also in the general fields where data is highly heterogeneous or scarcely annotated (Gal et al., 2017; Ablain et al., 2019). 1 Department of Computer Science and Engineering, The Chinese University of Hong Kong 2 Institute of Medical Intelligence and XR, The Chinese University of Hong Kong 3 Zhejiang Lab. Correspondence to: Guangyong Chen . To address this limitation, Dirichlet-based uncertainty models quantify different types of uncertainty by modeling the output as the concentration parameters of a Dirichlet distribution (Malinin & Gales, 2018; 2019; Nandy et al., 2020; Charpentier et al., 2020). Evidential deep learning (EDL) (Sensoy et al., 2018) adopts Dirichlet distribution and treats the output as evidence to quantify belief mass and uncer- Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1 𝑥 𝛼 𝑝 𝑦 (b) Classical Uncertainty Estimation by Fisher Information-based Evidential DeepEvidential Learning Network (a) EDL 0.63 0.43 0.72 0.63 0.14 𝐼-EDL 1.54 1.49 0.98 0.82 0.47 𝑥 0.23 0.12 0.15 0.35 0.09 I-EDL 1.48 1.44 1.36 1.35 1.28 𝑝 𝑦 (c) ℐ-Evidential Network Figure 2. Graphical model representation of I-EDL. The generative process of observed labels y is not only related to the predicted categorical probability p ∼ Dir(α), but also to the parameters α of the Dirichlet distribution. (b) EDL 𝛼 class label, the less corresponding information. Thus, we can set up a Gaussian distribution to help generate observed labels y by setting a larger variance for these less informative classes. As shown in Figure 2, the generative process of observed labels y is not only related to the predicted categorical probability p, but also to the parameters α of the Dirichlet distribution. Take the example of an informative image containing both a dog and a cat. The evidence for the dog and cat classes should be high if the neural network learns correctly. Our proposed generative model allows the observed label to be either dog or cat while retaining evidence for the other label, when the observed labels are set as one-hot vectors. From the perspective of optimization, compared with classical EDL, our proposed model encourages the network to focus more on the accuracy of classifying uncertain classes during the training process. From a generative model perspective, compared to classical EDL which only considers a single observation y, the FIM we introduce can be seen as a type of ground truth distribution. These improvements help enhance the classification performance and uncertainty estimation of the evidential neural networks. Finally, we further improve the generalization ability of our network by optimizing the PAC-Bayesian bound. Figure 1. Data uncertainty for (a) digit “4” in MNIST, (b) “horse” in CIFAR10. I-EDL has the ability to distinguish between hard samples (orange) and easy samples (green), but EDL cannot. tainty by jointly considering the Dempster–Shafer Theory of Evidence (DST) (Dempster, 1968; Shafer, 1976) and subjective logic (SL) (Jøsang, 2016). The evidential network proposed by EDL can be represented as a probabilistic graphical model, where the observed labels y are generated from the Dirichlet distribution with its parameter α calculated by passing the input sample x through the network. EDL, which learns optimal parameters by maximizing the expected likelihood of the observed labels, shows an impressive performance in uncertainty quantification and is widely used in various applications, such as graph neural networks (Zhao et al., 2020), open set recognition (OSR) (Bao et al., 2021; 2022), molecular property prediction (Soleimany et al., 2021), meta-learning (Pandey & Yu, 2022), and active learning (Hemmer et al., 2022). However, for samples with high data uncertainty but annotated with one-hot vectors, the learning process of evidence for those mislabeled classes is over-penalized and remains hindered. Data uncertainty is ubiquitous in real-world applications, such as the indistinguishable “4” and “9”, “1” and “7” in the MNIST dataset, images containing multiple objects in ImageNet, and the unavoidable noise labels in almost all datasets. Figure 1 illustrates that EDL cannot correctly distinguish MNIST and CIFAR10 image samples with different data uncertainties. Although EDL can model different types of uncertainty, its training process will underestimate the irreducible data uncertainty, thereby reducing the availability of uncertainty estimation. To our knowledge, we are the first to explicitly leverage evidence to improve uncertainty reliability by dynamically weighing objective loss terms. Our main contributions can be summarized as follows: (1) We propose a novel method to quantify uncertainty by combining Fisher information and evidential neural networks. (2) We introduce PAC-Bayesian bound to further improve the generalization ability. (3) Our proposed method achieves excellent empirical performance in confidence evaluation and OOD detection, especially in the more challenging few-shot classification setting. 2. Preliminary In this paper, we propose a simple and novel method, Fisher Information-based Evidential Deep Learning (I-EDL), to weigh the importance of different classes for each training sample. In particular, we introduce Fisher Information Matrix (FIM) to measure the information amount of evidence carried by the categorical probabilities p for each sample x. According to the derivation of Dirichlet distribution’s FIM, we found that the higher the evidence for a certain Evidential deep learning (EDL) (Sensoy et al., 2018) models the neural network by using the theory of subjective logic (SL) (Jøsang, 1997; 2016), a type of probabilistic logic that explicitly takes into account epistemic uncertainty and source trust. The opinion representation used to represent beliefs in SL provides greater expressive power than Boolean truths and probabilities, as it can express “I don’t 2 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning know” as an opinion on the truth of possible states. The concept of beliefs is derived from the Dempster–Shafer Theory of Evidence (DST), which is a generalization of the Bayesian theory to subjective probabilities (Dempster, 1968; Shafer, 1976). The main idea behind DST is to abandon the additivity principle of probability theory, which means that the sum of belief masses can be less than 1, and the remainder is supplemented by uncertainty mass, i.e., lack of evidence about the truth of state values. standard neural networks, EDL replaces the last softmax layer with an activation layer, e.g., ReLU or Softplus, to obtain a non-negative output, which is the evidence vector used to parameterize the Dirichlet distribution. Standard DNNs for classification with a softmax output function can be viewed as predicting the expected classification distribution of EDL with an exponential output function. This means it is not sensitive to arbitrary scaling of αk . Compared with the standard neural network classifiers that directly output the classification probability distribution of each sample, evidential neural networks obtain the density of classification probability assignments by parameterizing the Dirichlet distribution. Therefore, EDL models second-order probability and uncertainty (Jøsang, 2016), which can use the properties of Dirichlet distribution to distinguish different types of uncertainties, but classical EDL hinders the learning of evidence, especially for samples with high data uncertainty. More specifically, considering a state space consisting of K mutually exclusive singletons (e.g., class labels), SL provides a belief mass bk for each singleton k ∈ [K] and an overall uncertainty mass u. The K + 1 mass values satisfy PK u + k=1 bk = 1, where u ≥ 0, bk ≥ 0, ∀k ∈ [K]. Belief mass depends on the evidence for the corresponding singleton, which measures the amount of support collected from data. In the absence of evidence, the belief for each singleton is 0, and uncertainty is 1. Conversely, an infinite amount of evidence leaves no room for uncertainty, yielding belief masses that sum to 1. SL formalizes the belief assignment of DST as a Dirichlet distribution with concentration parameters αk = ek + 1, where ek denotes the derived evidence for the k-th singleton. That is, the belief bk and the uncertainty u can easily be derived from the parameters of the corresponding Dirichlet distribution by using bk = αk − 1 , α0 u= 3. Method 3.1. Generative Model of Evidential Network Evidential neural networks are typically trained using a combination of the expected mean squared error (MSE) and a Kullback-Leibler (KL) divergence term as a loss function, where the KL term penalizes evidence for classes that do not fit the training data. Note that, MSE performs best compared with the cross-entropy loss and the negative log marginal likelihood as empirically demonstrated in Sensoy et al. (2018). Let fθ : Rd → RK + denotes the evidential neural network. Given a sample (x, y), where y is the onehot encoded ground-truth class of observation x, the loss function of EDL is expressed as:   L(θ) = Ep∼Dir(α) (y − p)T (y − p) K , α0 PK where α0 = k=1 αk is referred to as the precision of the Dirichlet distribution. Higher values of α0 lead to sharper, more confident distributions. The Dirichlet distribution is the conjugate prior of the categorical distribution. It is parameterized by the concentration parameters α = [α1 , · · · , αK ], ∀αk > 0, defined as: Γ (α0 ) Dir(p|α) = QK K Y k=1 Γ (αk ) k=1 k −1 pα , α0 = k K X + λDKL (Dir(p|α̂)∥Dir(p|1)), where α = fθ (x) + 1, α̂ = α ⊙ (1 − y) + y, λ ≥ 0, 1 = [1; · · · ; 1] ∈ RK , and Dir(p|1) denotes the uniform Dirichlet distribution. αk k=1 Actually, when trained with MSE loss, the evidential network proposed by EDL can be understood as a new probabilistic graphical model. Specifically, let x, y denote random variables whose unknown probability distribution generates inputs x ∈ Rd and labels y ∈ RK , respectively. EDL (Sensoy et al., 2018) supposes the observed labels y were drawn i.i.d. from an isotropic Gaussian distribution, i.e. where p ∈ ∆K−1 , and Γ(·) is the gamma function. Given an opinion, the expected probability of the k-th singleton is equivalent to the mean of the corresponding Dirichlet distribution, calculated as p̂k = αk ek + 1 = PK . α0 c=1 ec + K y ∼ N (p, σ 2 I), When the evidence for one of the K attributes is observed, the corresponding Dirichlet parameter is incremented to update the Dirichlet distribution with the new observation. where p ∼ Dir(fθ (x)+1). Then, training evidential neural networks by minimizing the expected MSE can be viewed as learning model parameters that maximize the expected likelihood of the observed labels. Since the observed labels y are one-hot encoded, and the Gaussian distribution of EDL encourages neural networks to formalize the multiple opinions for the classification of a given sample as a Dirichlet distribution (Sensoy et al., 2018). Compared to 3 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning the generated labels is isotropic, i.e. each class is treated independently with the same variance, for high data uncertainty samples, the learning process of evidence for those mislabeled classes is over-penalized and remains hindered. This results in the total amount of evidence being underestimated and make the network overfitting. Furthermore, recently proposed work (Bengs et al., 2022) also argues that classical EDL does not incentivize learners to faithfully predict their epistemic uncertainty due to its sensitivity to the regularization parameter λ. Due to the concavity of the log function, by Jensen’s inequality, our objective of Eq.(2) can be achieved by minimizing the expected negative log-likelihood loss function:   min E(x,y)∼P Ep∼Dir(α) − log p(y|p, α, σ 2 ) . 3.2. Fisher Information-based Evidential Network 3.3. Learning with PAC-Bayesian Bound In this work, based on the fact that the information of each class carried in categorical probabilities p is different, we argue that the generation of each class for a specific sample should not be isotropic. Intuitively, a certain class label with higher evidence is allowed to have a larger variance, so that the evidence for missing labels can be preserved while maximizing the likelihood of the observed labels. The Fisher information matrix (FIM) is chosen to measure the amount of information that the categorical probabilities p carry about the concentration parameters α of a Dirichlet distribution that models p. Formally, the FIM is defined as:   ∂ℓ ∂ℓ , (1) I(α) = EDir(p|α) ∂α ∂αT Since the PAC-Bayesian theory (McAllester, 1999) provides data-driven generalization bounds computed on the training set and are simultaneously valid for all posteriors on network parameters, it is often used as a criterion for model selection or as an inspiration for learning algorithm conception. In the PAC-Bayes setting, it assumes that the predictor fθ has prior knowledge of the hypothesis space Θ in the form of a prior distribution π. After the training dataset D is fed to the predictor, the prior is updated to a posterior distribution ρ. The full bound theorem is restated below, derived from the theorems in Germain et al. (2009), Alquier et al. (2016), Masegosa (2020), and we give the proof in appendix A.3 for completeness. Theorem 3.1 ((Germain et al., 2009; Alquier et al., 2016; Masegosa, 2020)). Given a data distribution P over X × Y, a hypothesis set θ, a prior distribution π over Θ, for any δ ∈ (0, 1], and λ > 0, with probability at least 1 − δ over samples D ∼ P n , we have for all posterior ρ, θ s.t. α = fθ (x) + 1 (3)   2 ∂ log Dir(p|α) I(α) = EDir(p|α) − ∂ααT p(y|p, α, σ 2 ) = N (y|p, σ 2 I(α)−1 ) where ℓ = log Dir(p|α) is the log-likelihood function. Under weak conditions (see Lemma 5.3 in Lehmann & Casella (2006)), the FIM   can be expressed as I(α) = EDir(p|α) −∂ 2 ℓ/∂ααT . After applying a series of derivation steps (see Appendix A.1 for details), I(α) can be simplified to: Eρ(θ) [L(θ)] ≤ Eρ(θ) [L̂D (θ)]   1 1 + DKL (ρ∥π) + log + ΨP,π (λ, n) , λ δ h i where ΨP,π (λ, n) = log Eπ(θ) ED∼P n eλ(L(θ)−L̂D (θ)) . I(α) = diag([ψ (1) (α1 ), · · · , ψ (1) (αK )]) − ψ (1) (α0 )11T , where ψ (1) (·) denotes the trigamma function, defined as ψ (1) (x) = dψ(x)/dx = d2 ln Γ(x)/dx2 . Since ψ (1) (x) is a monotonically decreasing function when x > 0, the class label with higher evidence corresponds to less Fisher information. Hence, we use the inverse of the FIM (I(α)−1 ) as the variance of the generative distribution of y. In this paper, we treat Dir(p|α) as the posterior distribution, and the prior as Dir(p|µ), where µ is set to β ≫ 1 for the corresponding class and 1 for all other class. Given training set D = {(xi , yi )}N i=1 , π, λ and δ, by the Theorem 3.1, the upper bound of Eq.(3) can be expressed as Thus, we assume that the target variable y follows a multivariate Gaussian distribution with the following closed form: y ∼ N (p, σ 2 I(α)−1 ), N 1 X 1 Li (θ) + DKL (Dir(pi |αi )∥Dir(pi |µi )), (4) N i=1 λ   where Li (θ) = EDir(pi |αi ) − log N (yi |pi , σ 2 I(αi )−1 ) , αi = fθ (xi ) + 1 and pi ∼ Dir(αi ). where α = fθ (x) + 1, p ∼ Dir(α), σ 2 is the scalar used to adjust covariance value, I(α) is the FIM of Dir(α), defined as Eq.(1). In MLE, we aim to learn model parameters θ that maximize the marginal likelihood obtained by integrating the class probabilities, i.e.   max E(x,y)∼P log Ep∼Dir(α) [N (y|p, σ 2 I(α)−1 )] . The first term in Eq.(4) can be reformulated as a trade-off between the expected FIM-weighted MSE (I-MSE) and a penalty of negative log determinant of the FIM (|I|), since   Li (θ) ∝ E (yi − pi )T I(αi )(yi − pi ) −σ 2 log |I(αi )| . | {z } | {z } θ (2) LI-MSE i 4 |I| Li Uncertainty Estimation by Fisher Information-based Evidential Deep Learning Algorithm 1 I-Evidential Deep Learning Among them, since the class label with less evidence corresponds to the larger Fisher information, minimizing LI-MSE i can be viewed as: for a certain class with low evidence, we expect its corresponding prediction probability to be more accurate, regardless of whether this class is the ground-truth |I| label or not. The penalty for adding −Li can be seen as avoiding overconfidence caused by excessive evidence. It is worth noting that these loss terms are not only related to classes but also related to samples. Thus, the FIM-weighted term can be considered as an adaptive weight that can selfadjust the corresponding MSE loss based on the information of each class contained in the sample. N Input: λ, Training set D = {(xi , yi )}i=1 , batch size b, learning rate β, total epochs T Initialize θ for t = 0, 1, · · · , T do λt = min(1.0, t/T ) for Db ∼ D do for (xi , yi ) ∼ Db do αi = fθ (xi ) + 1 α̂i = αi ⊙ (1 − yi ) + yi |I| Li = LI-MSE − λLi + λt LKL //see Eq.(5) i i end for Pb θ ← θ − β∇θ L with L = 1b i=1 Li end for end for For the second term in Eq.(4), µi varies with sample labels. To simplify this Kullback-Leibler (KL) divergence, we set α̂i = αi ⊙ (1 − yi ) + yi to remove the predicted concentration parameter of the true label corresponding to the sample xi following Sensoy et al. (2018), thereby the KL term is converted to Table 1. Given a sample (xi , yi ), the difference in loss function between I-EDL and EDL are marked in blue. LKL i = DKL (Dir(pi |α̂i )∥Dir(pi |1)). I-EDL EDL Finally, the objective function Eq.(4) can be reformulated as min θ N 1 X N i=1 MSE |I| LI-MSE − λ1 Li + λ2 LKL i i , (5) + K  X j=1 |I| Li = K X αij 2 αij (αi0 − αij ) ) + 2 (yij − αi0 αi0 (αi0 + 1)  log ψ (1) (αij ) + log 1 − j=1 K X α̂ij ) − log Γ(K) − j=1 + K X ψ (1) (αij ) K X (α̂ij − 1) ψ(α̂ij ) − ψ( αij (αi0 −αij ) α2i0 (αi0 +1) + PK j=1 αij (αi0 −αij ) (1) ψ (αij ) α2i0 (αi0 +1) DKL (Dir(α̂i )∥Dir(1)) DKL (Dir(α̂i )∥Dir(1)) I - − log |I(αi )| computation of closed-form classical uncertainty metrics such as differential entropy, mutual information, etc. These metrics can be used to distinguish among data, model, and distributional uncertainty. Different DBU models differ in the parameterization and training strategy of the Dirichlet distribution. For example, KL-PN (Malinin & Gales, 2018) proposes the Prior Networks (PN) trained with two KL divergence terms. The first term is used to learn sharp Dirichlet parameters for ID data, while the other learns flat Dirichlet parameters for OOD data. Since the forward KL divergence is zero-avoiding, RKL-PN (Malinin & Gales, 2019) introduces the reverse KL divergence to avoid undesired multimodal target distributions. Posterior Network (PostN) (Charpentier et al., 2020) uses Normalizing Flows to predict the posterior distribution of any input sample without training with OOD data. Evidential Deep Learning (EDL) (Sensoy et al., 2018) treats the network’s outputs as belief masses based on the Dempster-Shafer Theory of Evidence (DST) (Sentz & Ferson, 2002) and derives the loss function using subjective logic (Jøsang, 2016). Moreover, deep evidential regression (Amini et al., 2020; Soleimany et al., 2021) introduces evidential priors over the original Gaussian likelihood function to model the uncertainty of  , log Γ(α̂ij ) K X j=1 (αij ), j=1 " j=1 ψ (1) K X ψ (1) (αi0 ) j=1 LKL i = log Γ(  PK αij 2 (1) (αij ) j=1 (yij − αi0 ) ψ PK KL where LI-MSE = i αij 2 j=1 (yij − αi0 ) PK # α̂ik ) , k=1 and λ1 , λ2 ≥ 0. The detailed derivation steps are left in Appendix A.2. The complete pseudo-code of our method is outlined in Algorithm 1. Actually, classical EDL can be |I| viewed as a degenerate version of I-EDL without Li and MSE I-MSE with Li instead of Li . Furthermore, we introduce the KL term from the PAC-Bayesian bound to make it more reasonable and interpretable. A detailed comparison of IEDL and classical EDL is given in Table 1. 4. Related Work Dirichlet-based uncertainty models (DBU) predict the parameters of the Dirichlet distribution, which allows the 5 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning regression networks. Meinert et al. (2022) further analyze why DER can produce reasonable results in practice despite overparameterized representations of uncertainty. Zhao et al. (2020) propose a multi-source uncertainty framework combined with DST for semi-supervised node classification with GNNs. Bao et al. (2022) propose a general framework for Open Set Temporal Action Localization (OSTAL) based on EDL. Although EDL shows an impressive performance in uncertainty quantification and is widely used in various applications, recently proposed work (Bengs et al., 2022) argues that classical EDL does not motivate learners to faithfully predict their epistemic uncertainty because it is sensitive to the regularization parameter. Compared with previous efforts, our method is the first to exploit evidence for training to improve the performance of uncertainty quantification. The Fisher information matrix (FIM) we introduce can be seen as some type of first-order distribution information, which can help learners make more accurate predictions and better estimate uncertainty. calibration results with deep ensembles. Additionally, there are also some methods of modeling with label noise. For example, Collier et al. (2021) propose input-dependent noise losses for label noise in classification. Cui et al. (2022) provides a unified framework for reliable learning under the joint (image, label)-noise. Compared to these works, we focus on the evidence underestimation problem in evidential networks, and more importantly, we are the first to address this problem by introducing the Fisher information matrix. 5. Experiments In this section, we conduct extensive experiments to compare the performance of our proposed method with previous methods on multiple uncertainty estimation-related tasks. See Appendix C for additional results and more details. 1 5.1. Experimental Setup Datasets We evaluate our algorithm on the following image classification datasets: MNIST (LeCun, 1998), CIFAR10 (Krizhevsky et al., 2009), and mini-ImageNet (Vinyals et al., 2016). For OOD detection experiments, we use KMNIST (Clanuwat et al., 2018) and FashionMNIST (Xiao et al., 2017) for MNIST, the Street View House Numbers (SVHN) (Netzer et al., 2018) and CIFAR100 (Krizhevsky et al., 2009) for CIFAR10, and the CaltechUCSD Birds (CUB) dataset (Wah et al., 2011) for miniImageNet. More details are given in Appendix C.1. Bayesian Neural Networks (BNNs) explicitly model network parameters as random variables, quantifying uncertainty by learning a posterior over parameters. Since the posterior inference of BNNs is intractable, many posterior approximation schemes have been developed to improve scalabilities, such as variational inference (VI) (Graves, 2011; Blundell et al., 2015), stochastic gradient Markov Chain Monte Carlo (Welling & Teh, 2011; Ma et al., 2015), and Laplace approximation (Ritter et al., 2018; Kristiadi et al., 2021). Furthermore, the integral of marginalizing the likelihood with the posterior distribution is also intractable and is typically approximated via sampling. A well-known method is Monte Carlo Dropout (MC Dropout) (Gal & Ghahramani, 2016), which treats the dropout layer as a Bernoulli distributed random variable, and training a network with dropout layers can be interpreted as an approximate VI. However, these methods require significant modifications to the training process and are computationally expensive, and more importantly, cannot distinguish between distributional uncertainty and other uncertainties. Implementation details Following Charpentier et al. (2020), we use 3 convolutional layers and 3 dense layers (ConvNet) on MNIST and VGG16 (Simonyan & Zisserman, 2014) on CIFAR10. For all experiments on both datasets, we split the data into train, validation, and test sets. We use a validation loss-based early termination strategy to train up to 200 epochs with a batch size of 64. For the mini-ImageNet dataset, we conduct experiments on the more challenging few-shot classification setting. We use WideResNet-28-10 pre-trained backbone from Yang et al. (2021) as the feature extractor to train a 1-layer classifier. Refer to Appendix C.2 for more details. Calibration methods aim to reduce over-confidence by calibrating models. For example, Guo et al. (2017) introduce temperature scaling as a post-hoc calibration to mitigate overconfidence. ODIN (Liang et al., 2018) uses a mix of temperature scaling at the softmax layer and input perturbations. Pereyra et al. (2017) penalize the low-entropy output distribution in the loss function. Karandikar et al. (2021) propose differentiable losses to improve calibration based on a soft version of the binning operation underlying popular calibration-error estimators. Roelofs et al. (2022) focus on assessing statistical bias in calibration. Since these methods cannot distinguish between different types of uncertainty, they are often combined with the first two types of methods, such as Lakshminarayanan et al. (2017) combined Baselines We focus on comparing our algorithm with other Dirichlet-based uncertainty methods, since only DBU methods can distinguish different types of uncertainty compared to BNNs and calibration methods, as mentioned previously. In particular, we compare to following baselines: Prior Networks (PN) trained with KL divergence (KL-PN) (Malinin & Gales, 2018) and Reverse KL divergence (RKLPN) (Malinin & Gales, 2019), Posterior Network (PostN) (Charpentier et al., 2020), and Evidential Deep Learning 1 The code is available at: danruod/IEDL 6 https://github.com/ Uncertainty Estimation by Fisher Information-based Evidential Deep Learning Table 2. AUPR scores of OOD detection (mean ± standard deviation of 5 runs). † indicates that the first four lines are the results reported by Charpentier et al. (2020). Bold and underlined numbers indicate the best and runner-up scores, respectively. MNIST → KMNIST† MNIST → FMNIST† CIFAR10 → SVHN† CIFAR10 → CIFAR100 Method Max.P α0 Max.P α0 Max.P α0 Max.P α0 MC Dropout KL-PN RKL-PN PostN EDL 94.00 ± 0.1 92.97 ± 1.2 60.76 ± 2.9 95.75 ± 0.2 97.02 ± 0.8 93.39 ± 1.0 53.76 ± 3.4 94.59 ± 0.3 96.31 ± 2.0 96.56 ± 0.2 98.44 ± 0.1 78.45 ± 3.1 97.78 ± 0.2 98.10 ± 0.4 98.16 ± 0.0 72.18 ± 3.6 97.24 ± 0.3 98.08 ± 0.4 51.39 ± 0.1 43.96 ± 1.9 53.61 ± 1.1 80.21 ± 0.2 78.87 ± 3.5 43.23 ± 2.3 49.37 ± 0.8 77.71 ± 0.3 79.12 ± 3.7 45.57 ± 1.0 61.41 ± 2.8 55.42 ± 2.6 81.96 ± 0.8 84.30 ± 0.7 61.53 ± 3.4 54.74 ± 2.8 82.06 ± 0.8 84.18 ± 0.7 I-EDL 98.34 ± 0.2 98.33 ± 0.2 98.89 ± 0.3 98.86 ± 0.3 83.26 ± 2.4 82.96 ± 2.2 85.35 ± 0.7 84.84 ± 0.6 5.3. OOD detection Table 3. AUPR scores and accuracy of CIFAR10 with VGG16 in misclassified image detection and image classification, respectively. Each experiment is run with 5 seeds. † denotes results reported by (Charpentier et al., 2020). Bold and underlined numbers indicate the best and runner-up scores, respectively. Method MC Dropout KL-PN† RKL-PN† PostN† EDL I-EDL † Max.P Max.α Acc. 97.15 ± 0.0 50.61 ± 4.0 86.11 ± 0.4 97.76 ± 0.0 97.86 ± 0.2 52.49 ± 4.2 85.59 ± 0.3 97.25 ± 0.0 97.86 ± 0.2 82.84 ± 0.1 27.46 ± 1.7 64.76 ± 0.3 84.85 ± 0.0 83.55 ± 0.6 98.72 ± 0.1 98.63 ± 0.1 89.20 ± 0.3 We then measure the usability of uncertainty quantification in the OOD detection task. The performance of OOD detection is also measured by AUPR with labels 1 for ID data and 0 for OOD data. The scores of P the DBU methods are given by maxc pc (Max.P) and α0 ( c αc ) respectively, while Dropout uses Max.P as scores. We compare our method with other methods on four OOD detection tasks, including MNIST against KMNIST and FMNIST, and CIFAR10 against SVHN and CIFAR100. Table 2 shows our proposed method achieves superior performance in all tasks without training with additional OOD data. More specifically, I-EDL outperforms the second-placed method by about 1.3%, 0.5%, 3.8% and 1.2% on four OOD detection tasks, respectively. Note that EDL does not achieve suboptimal performance on all OOD detection tasks. We also evaluate performance using differential entropy (D.Ent.) and mutual information (M.I.) as scores and area under a ROC curve (AUROC). All these results can be seen in Appendix C.3. Given lots of efforts contributed to OOD detection (Liang et al., 2018; Sastry & Oore, 2020), here we mainly focus on the comparisons with DBU models, which solve OOD detection by distinguishing different types of uncertainty. (EDL) (Sensoy et al., 2018). Furthermore, we compare the dropout model (MC Dropout) (Gal & Ghahramani, 2016), which is often state-of-the-art in many uncertainty estimation tasks (Ovadia et al., 2019). Since other methods except KL-PN and RKL-PN do not require OOD data for training. For a fair comparison, we use the uniform noise instead of actual OOD test data as OOD training data for the former two methods following Charpentier et al. (2020). 5.2. Confidence Evaluation 5.4. Few-shot Learning We first measure the availability of uncertainty estimates in the confidence evaluation tasks that aim to answer an interesting question ”Are more confident (i.e., less uncertain) predictions more likely to be correct?”. We use the area under the precision-recall curve (AUPR) metric. For DBU methods, we represent maxc pc (Max.P) and maxc αc (Max.α) respectively as the scores with labels 1 for correct and 0 for incorrect predictions. Since the dropout model does not have concentration parameters, we only provide results with Max.P as scores. Table 3 shows our proposed method achieves state-of-the-art performance in all measurements. In particular, our method improves image classification by about 5.2% and confidence estimation by about 0.9% compared to the runner-up methods. We next conduct more challenging few-shot experiments on mini-ImageNet. We use the WideResNet trained following Yang et al. (2021) to obtain pre-trained features, and then train the 1-layer classifier under N -way K-shot setting. We evaluate {5, 10}-way {1, 5, 20}-shot classification, confidence estimation and OOD detection. The performance of classification and uncertainty estimation are reported in the average accuracy(%, top-1) and AUPR(%), respectively, with 95% confidence interval over 10, 000 fewshot episodes. Each episode contains randomly sampled N classes and K samples per class for adaptation, min(15, K) query samples per class for image classification and confidence evaluation, and the same number of query samples 7 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning Table 4. Classification accuracy (Acc.), AUPR scores for both confidence evaluation (Conf.) and OOD detection (OOD) under {5, 10}-way {1, 5, 20}-shot settings of mini-ImageNet. CUB is used for OOD detection. Each experiment is run for over 10, 000 few-shot episodes. 5-Way 1-Shot 5-Way 5-Shot 5-Way 20-Shot Method Acc. Conf. (Max.α) OOD (α0 ) Acc. Conf. (Max.α) OOD (α0 ) Acc. Conf. (Max.α) OOD (α0 ) EDL I-EDL 61.00 ± 0.22 63.82 ± 0.20 80.59 ± 0.23 82.00 ± 0.21 65.40 ± 0.26 74.76 ± 0.25 80.38 ± 0.15 82.00 ± 0.14 93.92 ± 0.09 94.09 ± 0.09 76.53 ± 0.27 82.48 ± 0.20 85.54 ± 0.12 88.12 ± 0.09 97.51 ± 0.04 97.54 ± 0.04 79.78 ± 0.23 85.40 ± 0.19 ∆ 2.82 1.41 9.36 1.62 0.17 5.95 2.58 0.04 5.62 Method Acc. Conf. (Max.α) OOD (α0 ) Acc. Conf. (Max.α) OOD (α0 ) Acc. Conf. (Max.α) OOD (α0 ) EDL I-EDL 44.55 ± 0.15 49.37 ± 0.13 65.97 ± 0.20 68.29 ± 0.19 67.83 ± 0.24 71.95 ± 0.20 62.52 ± 0.16 67.89 ± 0.11 86.81 ± 0.10 87.45 ± 0.09 76.34 ± 0.20 82.29 ± 0.19 69.29 ± 0.17 78.60 ± 0.08 94.21 ± 0.06 94.40 ± 0.04 76.88 ± 0.17 82.52 ± 0.14 ∆ 4.82 2.32 4.12 5.37 0.64 5.95 9.31 0.19 5.64 10-Way 1-Shot 10-Way 5-Shot 10-Way 20-Shot Table 5. Ablation studies under mini-ImageNet 5-way 5-shot for image classification, confidence evaluation, and OOD detection against CUB. I-MSE |I| ✓ ✓ ✓ Figure 3. (a) Comparison of OOD detection (AUPR) on miniImageNet against CUB under 5-way 5-shot. (b) Noisy data detection on CIFAR10. ✓ Acc. Conf. (Max.α) OOD (α0 ) 80.38 ± 0.15 81.82 ± 0.14 81.27 ± 0.14 82.00 ± 0.14 93.92 ± 0.09 93.97 ± 0.09 94.42 ± 0.08 94.09 ± 0.09 76.53 ± 0.27 79.68 ± 0.25 81.75 ± 0.22 82.48 ± 0.20 of the ID dataset. As shown in Fig. 3(b), our method demonstrates excellent detection of noisy data across all metrics, outperforming the runner-up method by more than 11%. from CUB for OOD detection. 5.6. Ablation Study As shown in Table 4, the average test accuracy, Max.αbased confidence evaluation, and α0 -based OOD detection show impressive improvements on I-EDL over EDL on all the N -way K-shot tasks. More specifically, all average test accuracy improvements of our method exceed 1.62%, up to 9.31% under 10-way 20-shot. In confidence evaluation, I-EDL also shows better performance than EDL, especially the improvement over 2.32% under 10-way 1-shot. Moreover, I-EDL shows excellent performance in OOD detection, where all improvements are between 4.12% and 9.36%. Furthermore, we compare the OOD detection performance of our method and other DBU methods under 5-way 5-shot in Fig. 3(a). All of these results demonstrate that our method not only improves classification accuracy but also greatly improves the availability of uncertainty estimation in the more challenging few-shot scenarios. More results including AUPR and AUROC scores based on M.I. and D.Ent are provided in Appendix C.4. We further investigate our method performance with an ablation study and summarize it in Table 5. We respectively ablate the effects of expected FIM-weighted MSE (I-MSE) and FIM’s negative log determinant (|I|). Note that classical EDL is equivalent to I-EDL without using I-MSE and |I|. From the result, we observe that both optimizations are beneficial for image classification, confidence evaluation, and OOD detection. In particular, with the only usage of |I| (I-MSE), the improvements over EDL for classification and α0 -based OOD detection are ∼ 1.9% (1.3%) and ∼ 4.1% (6.8%). If both optimizations are used together, the improvements increase to about 2.0% and 7.8%. Thus, the combination of two optimizations achieves a win-win effect. More ablation studies are provided in Appendix C.5. 5.7. Analysis of Uncertainty estimation Figure 4 represents density plots of the predicted differential entropy and mutual information2 . Lower entropy or mutual information represents the model yields a sharper distribution, indicating that the sample has low uncertainty. We also 5.5. Noisy data detection We finally evaluate our method on noisy examples. Noisy examples are generated by adding zero-mean isotropic Gaussian noise with standard deviation σ = 0.1 to the test data 2 8 Detailed formulas refer to Eq.(18) and Eq.(16) in Appendix B Uncertainty Estimation by Fisher Information-based Evidential Deep Learning tion, it cannot be directly applied to regression tasks. A naive approach is to discretize the regression task, but this is not a good solution because it loses information about continuous labels, such as order. However, we believe that avoiding overfitting caused by data uncertainty in regression tasks and how to apply our ideas to regression tasks is a very attractive problem worthy of exploration in future work. Acknowledgments We would like to thank Prof. Chang-Yu Hsieh for his helpful discussions on Evidential Deep Learning, and anonymous reviewers for their feedback and constructive comments to improve this work. This work is supported by the National Key R&D Program of China (2022YFE0200700), the Research Grants Council of the Hong Kong Special Administrative Region, China (Project Reference Number: T45-401/22-N and 14201321), the National Natural Science Foundation of China (Project No. 62006219), the Natural Science Foundation of Guangdong Province (2022A1515011579), and the Hong Kong Innovation and Technology Fund (Project No. ITS/170/20 and ITS/241/21). Figure 4. Uncertainty representation for ID (MNIST) and OOD (FMNIST). More results are shown in Appendix C.6. report the energy distance (Székely & Rizzo, 2013) of two distributions (Formula is given in Appendix B.4.), which shows that our method provides more separable uncertainty estimates. More specifically, I-EDL produces sharper prediction peaks than EDL, both in the low uncertainty region of ID samples and the high uncertainty region of OOD samples. Furthermore, our method also reduces the occurrence of ID samples in high-uncertainty regions. References Ablain, M., Meyssignac, B., Zawadzki, L., Jugier, R., Ribes, A., Spada, G., Benveniste, J., Cazenave, A., and Picot, N. Uncertainty in satellite estimates of global mean sea-level changes, trend and acceleration. Earth System Science Data, 11(3):1189–1202, 2019. Alquier, P., Ridgway, J., and Chopin, N. On the properties of variational approximations of gibbs posteriors. The Journal of Machine Learning Research, 17(1):8374–8414, 2016. 6. Conclusion In this paper, we found that the classical EDL trained with mean square error would hinder the learning of evidence, especially for high data uncertainty samples. To address this issue, we propose a novel and simple method, Fisher Information-based Evidential Deep Learning (I-EDL) to alleviate the over-penalization of the mislabeled classes by considering importance weights with different classes for each sample. More specifically, we introduce the perspective of generative models to model evidential networks, where the observed label is jointly generated by the predicted categorical probability and the informativeness of each class contained in the sample. The categorical probabilities are generated from the Dirichlet distribution with its concentration parameter calculated by passing the input sample through the evidential network, while the information is obtained from FIM. Extensive experiments on various image classification, confidence evaluation and OOD detection tasks, as well as comparisons with some state-of-the-art algorithms, demonstrate the effectiveness of our approach in achieving high classification and uncertainty quantification. Since our method is designed based on Dirichlet distribu- Amini, A., Schwarting, W., Soleimany, A., and Rus, D. Deep evidential regression. Advances in Neural Information Processing Systems, 33:14927–14937, 2020. Bao, W., Yu, Q., and Kong, Y. Evidential deep learning for open set action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13349–13358, 2021. Bao, W., Yu, Q., and Kong, Y. Opental: Towards open set temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2979–2989, 2022. Bengs, V., Hüllermeier, E., and Waegeman, W. Pitfalls of epistemic uncertainty quantification through loss minimisation. In Advances in Neural Information Processing Systems, 2022. Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural network. In International 9 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning conference on machine learning, pp. 1613–1622. PMLR, 2015. Graves, A. Practical variational inference for neural networks. Advances in neural information processing systems, 24, 2011. Charpentier, B., Zügner, D., and Günnemann, S. Posterior network: Uncertainty estimation without ood samples via density-based pseudo-counts. Advances in Neural Information Processing Systems, 33:1356–1367, 2020. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR, 2017. Choi, J., Chun, D., Kim, H., and Lee, H.-J. Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 502–511, 2019. Hemmer, P., Kühl, N., and Schöffer, J. Deal: deep evidential active learning for image classification. In Deep Learning Applications, Volume 3, pp. 171–192. Springer, 2022. Jøsang, A. Artificial reasoning with subjective logic. In Proceedings of the second Australian workshop on commonsense reasoning, volume 48, pp. 34. Citeseer, 1997. Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718, 2018. Jøsang, A. Subjective logic, volume 3. Springer, 2016. Karandikar, A., Cain, N., Tran, D., Lakshminarayanan, B., Shlens, J., Mozer, M. C., and Roelofs, B. Soft calibration objectives for neural networks. Advances in Neural Information Processing Systems, 34:29768–29779, 2021. Collier, M., Mustafa, B., Kokiopoulou, E., Jenatton, R., and Berent, J. Correlated input-dependent label noise in large-scale image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1551–1560, 2021. Kristiadi, A., Hein, M., and Hennig, P. Learnable uncertainty under laplace approximations. In Uncertainty in Artificial Intelligence, pp. 344–353. PMLR, 2021. Cui, P., Yue, Y., Deng, Z., and Zhu, J. Confidence-based reliable learning under dual noises. Advances in Neural Information Processing Systems, 35:35116–35129, 2022. Kristiadi, A., Hein, M., and Hennig, P. Being a bit frequentist improves bayesian neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 529–545. PMLR, 2022. Dempster, A. P. A generalization of bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological), 30(2):205–232, 1968. Feng, D., Rosenbaum, L., and Dietmayer, K. Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection. In 2018 21st international conference on intelligent transportation systems (ITSC), pp. 3266–3273. IEEE, 2018. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. cs.toronto.edu, 2009. Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017. Gal, Y. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016. LeCun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. PMLR, 2016. Lehmann, E. L. and Casella, G. Theory of point estimation. Springer Science & Business Media, 2006. Liang, S., Li, Y., and Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, 2018. Gal, Y., Islam, R., and Ghahramani, Z. Deep bayesian active learning with image data. In International Conference on Machine Learning, pp. 1183–1192. PMLR, 2017. Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. Pac-bayesian learning of linear classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 353–360, 2009. Ma, Y.-A., Chen, T., and Fox, E. A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 28, 2015. Ghaffari, S., Saleh, E., Forsyth, D., and Wang, Y.-X. On the importance of firth bias reduction in few-shot classification. arXiv preprint arXiv:2110.02529, 2021. Malinin, A. and Gales, M. Predictive uncertainty estimation via prior networks. Advances in neural information processing systems, 31, 2018. 10 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning Malinin, A. and Gales, M. Reverse kl-divergence training of prior networks: Improved uncertainty and adversarial robustness. Advances in Neural Information Processing Systems, 32, 2019. Ritter, H., Botev, A., and Barber, D. A scalable laplace approximation for neural networks. In 6th International Conference on Learning Representations, ICLR 2018Conference Track Proceedings, volume 6. International Conference on Representation Learning, 2018. Masegosa, A. Learning under model misspecification: Applications to variational and ensemble methods. Advances in Neural Information Processing Systems, 33: 5479–5491, 2020. Roelofs, R., Cain, N., Shlens, J., and Mozer, M. C. Mitigating bias in calibration error estimation. In International Conference on Artificial Intelligence and Statistics, pp. 4036–4054. PMLR, 2022. McAllester, D. A. Some pac-bayesian theorems. Machine Learning, 37(3):355–363, 1999. Sastry, C. S. and Oore, S. Detecting out-of-distribution examples with gram matrices. In International Conference on Machine Learning, pp. 8491–8501. PMLR, 2020. Meinert, N., Gawlikowski, J., and Lavin, A. The unreasonable effectiveness of deep evidential regression. arXiv preprint arXiv:2205.10060, 2022. Seeböck, P., Orlando, J. I., Schlegl, T., Waldstein, S. M., Bogunović, H., Klimscha, S., Langs, G., and SchmidtErfurth, U. Exploiting epistemic uncertainty of anatomy segmentation for anomaly detection in retinal oct. IEEE transactions on medical imaging, 39(1):87–98, 2019. Nair, T., Precup, D., Arnold, D. L., and Arbel, T. Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Medical image analysis, 59:101557, 2020. Sensoy, M., Kaplan, L., and Kandemir, M. Evidential deep learning to quantify classification uncertainty. Advances in Neural Information Processing Systems, 31, 2018. Nandy, J., Hsu, W., and Lee, M. L. Towards maximizing the representation gap between in-domain & out-ofdistribution examples. Advances in Neural Information Processing Systems, 33:9239–9250, 2020. Sentz, K. and Ferson, S. Combination of evidence in dempster-shafer theory. US Department of Energy, 2002. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. The street view house numbers (svhn) dataset. Technical report, Technical report, Accessed 2016-0801.[Online], 2018. Shafer, G. A mathematical theory of evidence, volume 42. Princeton university press, 1976. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., and Snoek, J. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019. Soleimany, A. P., Amini, A., Goldman, S., Rus, D., Bhatia, S. N., and Coley, C. W. Evidential deep learning for guided molecular property prediction and discovery. ACS central science, 7(8):1356–1367, 2021. Pandey, D. S. and Yu, Q. Multidimensional belief quantification for label-efficient meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14391–14400, 2022. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016. Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017. Székely, G. J. and Rizzo, M. L. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249–1272, 2013. Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset shift in machine learning. Mit Press, 2008. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016. Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J. B., Larochelle, H., and Zemel, R. S. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018. Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltech-ucsd birds-200-2011 dataset. caltech.edu, 2011. 11 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688. Citeseer, 2011. Wilson, A. G. and Izmailov, P. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708, 2020. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017. Yang, S., Liu, L., and Xu, M. Free lunch for fewshot learning: Distribution calibration. arXiv preprint arXiv:2101.06395, 2021. Zaidi, S., Zela, A., Elsken, T., Holmes, C. C., Hutter, F., and Teh, Y. Neural ensemble search for uncertainty estimation and dataset shift. Advances in Neural Information Processing Systems, 34:7898–7911, 2021. Zhao, X., Chen, F., Hu, S., and Cho, J.-H. Uncertainty aware semi-supervised learning on graph data. Advances in Neural Information Processing Systems, 33:12827– 12836, 2020. 12 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning A. Derivation and Proof This section provides the derivation of the Fisher Information Matrix (FIM) of Dirichlet distribution and the final objective function of Eq. 5. We also provide a brief overview of the proof of Theorem 3.1 from (Germain et al., 2009; Alquier et al., 2016; Masegosa, 2020). A.1. FIM Derivation for Dirichlet Distribution The Fisher Information Matrix (FIM) of Dirichlet distribution is defined as:   ∂ℓ ∂ℓ I(α) = E ∈ RK×K , ∂α ∂αT where ℓ = log Dir(p|α) is the log-likelihood function. Under weak conditions (see Lemma 5.3 in (Lehmann & Casella,  2006)), the FIM can be expressed as I(α) = EDir(p|α) −∂ 2 ℓ/∂ααT . Thus, we can calculate each element by   ∂2 [I(α)]ij = EDir(p|α) − log Dir(p|α)|α ∂αi ∂αj " !# K K X X ∂2 log Γ (α0 ) − log Γ (αk ) + (αk − 1) log pk = EDir(p|α) − ∂αi ∂αj k=1 k=1   ∂ = EDir(p|α) − (ψ (α0 ) − ψ (αi ) + log pi ) ∂αj ( ψ (1) (αi ) − ψ (1) (α0 ) , i = j, = −ψ (1) (α0 ) , i ̸= j, where Γ(·) is the gamma function, ψ(·) is the digamma function, ψ (1) (·) is the trigamma function, defined as ψ (1) (x) = dψ(x)/dx = d2 ln Γ(x)/dx2 . Then, we can get the matrix form of the FIM:  (1)  ψ (α1 ) − ψ (1) (α0 ) −ψ (1) (α0 ) ··· −ψ (1) (α0 )   −ψ (1) (α0 ) ψ (1) (α2 ) − ψ (1) (α0 ) · · · −ψ (1) (α0 )   I(α) =   .. .. . .. ..   . (6) . . −ψ (1) (α0 ) −ψ (1) (α0 ) ··· ψ (1) (αK ) − ψ (1) (α0 ) = diag([ψ (1) (α1 ), · · · , ψ (1) (αK )]) − ψ (1) (α0 )11T , where 1 = [1; · · · ; 1] ∈ RK . Let b = [ψ (1) (α1 ), · · · , ψ (1) (αK )]T , by applying Matrix-Determinant Lemma, we have |I(α)| = |diag(b)| · (1 − ψ (1) (α0 )1T diag(b)−1 1) = K Y ψ (1) (αi )(1 − i=1 K X ψ (1) (α0 ) j=1 ψ (1) (αj ) ). Therefore, log |I(α)| = K X log ψ (1) (αi ) + log(1 − i=1 K X ψ (1) (α0 ) i=1 ψ (1) (αi ) ). (7) A.2. Derivation of the objective function Eq. 5 Given training set D = {(xi , yi )}N i=1 , by applying Theorem 3.1, the upper bound of Eq.(3) can be expressed as     min E(x,y)∼D Ep∼Dir(α) − log p(y|p, α, σ 2 ) + λDKL (Dir(p|α̂)∥Dir(p|1)) θ s.t. α = fθ (x) + 1 α̂ = α ⊙ (1 − y) + y I(α) = diag([ψ (1) (α1 ), · · · , ψ (1) (αK )]) − ψ (1) (α0 )11T p ∼ Dir(α) p(y|p, α, σ 2 ) = N (y|p, σ 2 I(α)−1 ). 13 (8) Uncertainty Estimation by Fisher Information-based Evidential Deep Learning We first simplify the first term EDir(p|α) [− log N (y|p, σ 2 I(α)−1 )] (abbreviated as E[− log N (y|p, σ 2 I(α)−1 )]), E[− log N (y|p, σ 2 I(α)−1 )]    1 1 T 2 −k 2 2 =E − log (2πσ ) |I(α)| exp(− 2 (y − p) I(α)(y − p)) 2σ   k 1 1 T 2 =E log(2πσ ) − log |I(α)| + 2 (y − p) I(α)(y − p) 2 2 2σ  k 1 1  = log(2πσ 2 ) − log |I(α)| + 2 E (y − p)T I(α)(y − p) 2 2 2σ (9)   Then, E (y − p)T I(α)(y − p) can be converted to:   E (y − p)T I(α)(y − p)   =y T I(α)y − 2y T I(α)E[p] + E pT I(α)p   =y T I(α)y − 2y T I(α)E[p] + E[p]T I(α)E[p] − E[p]T I(α)E[p] + E pT I(α)p   =(y − E[p])T I(α)(y − E[p]) + E (p − E[p])T I(α)(p − E[p]) (10) Since E[p] = α/α0 , Cov(pi , pj ) = (δij αi α0 − αi αj )/(α02 (α0 + 1)), where δij is the Kronecker delta (i.e. δij = 1 if i = j, else δij = 0), combined with the value of the FIM (Eq. 6), we have (y − E[p])T I(α)(y − E[p])  (1)  ψ (α1 ) − ψ (1) (α0 ) −ψ (1) (α0 ) ··· −ψ (1) (α0 )  −ψ (1) (α0 ) ψ (1) (α2 ) − ψ (1) (α0 ) · · · −ψ (1) (α0 ) α α T   =(y − )  )  (y − .. .. . . . . α0  α  0 . . . . −ψ (1) (α0 ) −ψ (1) (α0 ) · · · ψ (1) (αK ) − ψ (1) (α0 )   α T α =(y − ) diag([ψ (1) (α1 ), · · · , ψ (1) (αK )]) − ψ (1) (α0 )11T (y − ) α0 α0 K X αi 2 (1) ) ψ (αi ) , = (yi − α 0 i=1 (11) and   E (p − E[p])T I(α)(p − E[p]) = K X i,j=1 = = K X δij αi α0 − αi αj Cov(pi , pj )I(i, j) = i,j=1 α02 (α0 + 1) (δij ψ (1) (αi ) − ψ (1) (α0 )) K X K K X X αi α0 − αi2 (1) αi α0 αi αj (1) (1) ψ (α ) − ψ (α ) + (α0 ) i 0 2 (α + 1) 2 (α + 1) 2 (α + 1) ψ α α α 0 0 0 i=1 0 i=1 0 i,j=1 0 K X αi (α0 − αi ) α02 (α0 + 1) i=1 (12) ψ (1) (αi ) . Plugging (11) and (12) into (10), we have   E (y − p)T I(α)(y − p) =  K K  X X αi 2 (1) αi (α0 − αi ) (1) αi 2 αi (α0 − αi ) (yi − ) ψ (αi ) + 2 ψ (αi ) = (yi − ) + 2 ψ (1) (αi ) α α (α + 1) α α (α + 1) 0 0 0 0 0 0 i=1 i=1 14 (13) Uncertainty Estimation by Fisher Information-based Evidential Deep Learning Furthermore, for the KL term, we have DKL (Dir(p|α̂)∥Dir(p|1)) = log Γ( K X α̂k ) − log Γ(K) − k=1 K X log Γ(α̂k ) + k=1 K X  (α̂k − 1) ψ(α̂k ) − ψ( K X  (14) α̂j ) j=1 k=1 Plugging (7), (13) and (14) into (8), we can obtain the final objective of Eq. 5   E − log p(y|p, α, σ 2 ) + λDKL (Dir(p|α̂)∥Dir(p|1)) !  K K K  (1) X X X ψ (α ) αi 2 αi (α0 − αi ) 0 log ψ (1) (αi ) + log(1 − ) ) + 2 ψ (1) (αi ) −λ1 ∝ (yi − α α (α + 1) ψ (1) (αi ) 0 0 0 i=1 i=1 i=1 {z } | {z } | LI-MSE i  + λ2 log Γ( K X k=1 |I| Li α̂k ) − log Γ(K) − K X log Γ(α̂k ) + k=1 K X   K X (α̂k − 1) ψ(α̂k ) − ψ( α̂j ) . k=1 j=1 {z | LKL i } A.3. Proof of Theorem 3.1 Theorem 3.1 ((Germain et al., 2009; Alquier et al., 2016; Masegosa, 2020)). Given a data distribution P over X × Y, a hypothesis set θ, a prior distribution π over θ, for any δ ∈ (0, 1], and λ > 0, with probability at least 1 − δ over samples D ∼ P n , we have for all posterior ρ,   1 1 Eρ(θ) [L(θ)] ≤ Eρ(θ) [L̂(θ, D)] + DKL (ρ∥π) + log + ΨP,π (λ, n) λ δ h i where ΨP,π (λ, n) = log Eπ(θ) ED∼P n eλ(L(θ)−L̂(θ,D)) . Proof. The Donsker-Varadhan’s change of measure states that for any measurable function ϕ : θ → R, we have Eρ(θ) [ϕ(θ)] ≤ DKL (ρ∥π) + log Eπ(θ) [eϕ(θ) ]   Thus, with ϕ(θ) := λ L(θ) − L̂(θ, D) , we obtain ∀ρ on θ: h  i  h i Eρ(θ) λ L(θ) − L̂(θ, D) = λ Eρ(θ) [L(θ)] − Eρ(θ) L̂(θ, D) h i ≤ DKL (ρ∥π) + log Eπ(θ) eλ(L(θ)−L̂(θ,D)) h i Next, we apply Markov’s inequality on the random variable ζπ (D) := Eπ(θ) eλ(L(θ)−L̂(θ,D)) :   1 Pr ζπ (D) ≤ ED∼P n [ζπ (D)] ≥ 1 − δ δ This implies that with probability at least 1 − δ over the choice of D ∼ P n , we have ∀ρ on θ:    1 1 Pr Eρ(θ) [L(θ)] ≤ Eρ(θ) [L̂(θ, D)] + DKL (ρ∥π) + log + ΨP,π (λ, n) ≥ 1 − δ, λ δ h i where ΨP,π (λ, n) = log Eπ(θ) ED∼P n eλ(L(θ)−L̂(θ,D)) . 15 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning B. Derivations for Uncertainty Measures and Energy Distance The Dirichlet distribution is parameterized by its concentration parameters α = [α1 , · · · , αK ], ∀αc > 0, defined as: K K Y X Γ (α0 ) Dir(p|α) = QK pkαc −1 , α0 = αc c=1 c=1 Γ (αc ) c=1 where p ∈ ∆K−1 , and Γ(·) is the gamma function. Note that the parameters α is calculated by passing the input sample through the evidential network (α = fθ (x) + 1 ∈ RK + ). The following derivation is adapted from the Appendix of (Malinin & Gales, 2018). B.1. Expected Entropy of Dirichlet-based Uncertainty Models The derivation of the expected entropy is as follows: ! Z K X Ep∼Dir(α) [H[p(y | p)]] = Dir(p|α) − pc ln pc dp S K−1 =− c=1 K Z X c=1 Dir(p|α) (pc ln pc ) dp S K−1 K Y Γ (α0 ) pkαk −1 (pc ln pc ) dp QK K−1 Γ (α ) S k k=1 c=1 k=1 Z K K X Y αc Γ (α0 + 1) c =− pkαk −1 pα QK c ln pc dp K−1 α0 Γ (α + 1) Γ (α ) S c k c=1 k=1,k̸=c k=1,k̸=c Z K X αc =− Ep∼Dir([α1 ,··· ,αc−1 ,αc +1,αc+1 ,··· ,αK ]) [ln pc ]dp α K−1 c=1 0 S =− =− K Z X K X αc c=1 α0 (15) (ψ(αc + 1) − ψ(α0 + 1)) The last third equation comes from the fact that Γ(n) = (n − 1)!. Since the expected entropy captures the peaks of the output distribution p(y | p), it is used to measure data uncertainty. More specifically, lower entropy means that the model concentrates all probability mass on one class, while high entropy indicates that all p generated by Dir(α) are more uniformly distributed. B.2. Mutual Information of Dirichlet-based Uncertainty Models In the DBU models, the mutual information between the labels y and the categorical p can be deduced by computing the difference between the entropy of the expected distribution and the expected entropy of the distribution, which can be viewed as the difference between the total amount of uncertainty and the data uncertainty.   I [y, p | x, D] = H Ep(p|x,D) [p(y | p)] − Ep(p|x,D) [H[p(y | p)]] . (16) | {z } | {z } | {z } Distributional Uncertainty Total Uncertainty Expected Data Uncertainty Assuming that point estimate p(θ|D) = δ(θ − θ̂) ⇒ p(p|x, D) ≈ p(p|x, θ̂) = Dir(p|α) is sufficient given appropriate regularization and training data size, the mutual information can be simplified by   I [y, p | x, D] ≈ H Ep∼Dir(α) [p(y | p)] − Ep∼Dir(α) [H[p(y | p)]] | {z } {z } | {z } | Distributional Uncertainty Total Uncertainty K αc X αc + (ψ (αc + 1) − ψ (α0 + 1)) α α0 c=1 α0 c=1 0   K X αc αc =− ln − ψ (αc + 1) + ψ (α0 + 1) α α0 c=1 0 =− K X αc Expected Data Uncertainty ln 16 (17) Uncertainty Estimation by Fisher Information-based Evidential Deep Learning The second term in this derivation is from the results of B.1. The mutual information is often used to measure distributional uncertainty, as it captures the uniform output distribution that excludes data uncertainty. High mutual information indicates a uniform distribution of expected categorical probability with low data uncertainty. B.3. Differential Entropy of Dirichlet-based Uncertainty Models The derivation of the differential entropy is as follows: Z H [Dir(p | α)] = − Dir(p|α) ln Dir(p|α)dp S K−1 Z =− Dir(p|α) ln Γ (α0 ) − S K−1 = K X c=1 ln Γ (αc ) − ln Γ (α0 ) − c=1 = K X K X K X ln Γ (αc ) + K X ! (αc − 1) ln pk dp c=1 (18) (αc − 1) Ep∼Dir(α) [ln pc ] c=1 ln Γ (αc ) − ln Γ (α0 ) − c=1 K X (αc − 1) (ψ (αc ) − ψ (α0 )) c=1 The last equation comes from Ep∼Dir(α) [ln pc ] = ψ (αc ) − ψ (α0 ). Lower entropy indicates the model yields a sharper distribution, while high entropy denotes a more uniform Dirichlet distribution. Thus, differential entropy is also a common measure of distributional uncertainty. B.4. Energy Distance Energy distance is a metric that measures the distance between the distributions of random vectors. Let X and Y be independent random vectors in Rd , with cumulative distribution function (CDF) F and G, respectively. The energy distance can be defined in terms of expected distances between the random vectors, D(F, G) = (2E∥X − Y ∥ − E∥X − X ′ ∥ − E∥Y − Y ′ ∥) 1/2 , (19) where X and X ′ (resp. Y and Y ′ ) are independent random variables whose probability distribution is F (resp. G), ∥ · ∥ denotes the Euclidean norm. Energy distance is zero if and only if the distributions are identical. C. Experimental Details and Additional Results C.1. Datasets MNIST (LeCun, 1998) is a database of handwritten digits 0 to 9, consisting of a training set of 60, 000 examples and a test set of 10, 000 examples. Each input is composed of a 1 × 28 × 28 tensor. We use (80%, 20%) to split the training samples into training and validation sets. For OOD detection experiment, we use KMNIST (Clanuwat et al., 2018) and FashionMNIST (Xiao et al., 2017) containing images of Japanese characters and images of clothes, respectively. CIFAR10 (Krizhevsky et al., 2009) consists of 60, 000 images in 10 classes, including airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. Among them, there are 50, 000 training images and 10, 000 test images. Each input is composed of a 3 × 32 × 32 tensor. We use (95%, 5%) to split the training samples into training and validation sets. Street View House Numbers (SVHN) dataset (Netzer et al., 2018), a dataset containing digital images, and CIFAR100 are used for OOD detection. mini-ImageNet (Vinyals et al., 2016) dataset was proposed for few-shot learning evaluation. It contains 100 classes with 600 samples of 84 × 84 color images per class. These 100 classes are divided into 64, 16, and 20 classes respectively for sampling tasks for meta-training, meta-validation, and meta-test. For the few-shot classification task, we evaluate N -way K-shot classification tasks for N ∈ {5, 10} and K ∈ {1, 5, 20, 50} and report the average accuracy(%, top-1) and 95% confidence interval over 10, 000 few-shot episodes on meta-test set. Each episode contains randomly sampled N classes and K samples per class for adaptation, and min(15, K) query samples per class for evaluation. For OOD detection experiments, we use Caltech-UCSD Birds (CUB) dataset (Wah et al., 2011), contains 11, 788 images of 200 subcategories belonging to birds. Note that we randomly sample the same number of OOD examples as the query sample for OOD detection. 17 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning tiered-ImageNet (Ren et al., 2018) dataset is a larger subset of ILSVRC-12. Compared with the 100 classes of miniImageNet, it contains 608 classes (779,165 images), which are grouped into 34 higher-level nodes in the ImageNet human-curated hierarchy. The nodes are divided into 20, 6, and 8 disjoint sets of training, validation, and testing nodes, with the corresponding classes forming their respective meta-sets. We conduct the few-shot classification and OOD detection experiments for CUB on this dataset using the same settings as mini-ImageNet. C.2. Implementation details For the MNIST and CIFAR10 datasets, it is implemented by adapting the code provided by (Charpentier et al., 2020). Following (Charpentier et al., 2020), we use 3 convolutional layers with 3 dense layers and VGG16 (Simonyan & Zisserman, 2014), respectively. Softplus is used in the last layer to get the non-negative output. We use a validation loss-based early termination strategy to train up to 200 epochs with a batch size of 64. The learning rate is set to 0.001 for MNIST and FMNIST, 0.0005 for CIFAR10. The coefficient λ of -|I| is set by grid-search (0.1, 0.05, 0.01, 0.005, 0.001). The last chosen hyperparameter is 0.005 for MNIST, 0.01 for FMNIST and 0.05 for CIFAR10. For the mini-ImageNet and tiered-ImageNet few-shot classification experiments, it is implemented by adapting the code provided by (Ghaffari et al., 2021). More specifically, we use WideResNet-28-10 pre-trained backbone from (Yang et al., 2021) as feature extractor to train the 1-layer classifier. Softplus is used as the activation function to obtain the non-negative output. The coefficient λ is also set by grid-search on the meta-validation set. Table 6 reports the last chosen hyperparameter for few-shot settings. Figure 6 shows the impact of λ under 5-way mini-ImageNet setting. Table 6. List of hyperparameters for our approach. λ 5way1shot 5way5shot 5way20shot 5way50shot 10way1shot 10way5shot 10way20shot 10way50shot 0.01 0.05 0.005 0.01 0.1 0.01 0.01 0.01 C.3. Additional Experimental Results on OOD detection Table 7 and 8 displays the AUPR and AUROC scores of OOD detection on CIFAR10 against CIFAR100 and SVHN, MNIST against KMNIST and FMNIST. Table 9 compares our method and label smoothing on the few-shot setting. We use maximum probability (Max.P) and entropy (Ent.) to measure uncertainty for label smoothing, and Max.P and α0 for I-EDL. All of these results consistently demonstrate our proposed method’s superior OOD detection performance. C.4. Additional Experimental Results on Few-shot Learning Table 10 shows the AUROC scores of OOD detection under {5, 10}-way {1, 5, 20, 50}-shot of mini-ImageNet. For all the N -way K-shot tasks, AUROC with Max.P, α0 , D.Ent, and M.I. show impressive improvements on I-EDL over EDL. For example, the improvements of OOD detection after using I-EDL are all above 3.37%, especially the improvement of Max.P-based OOD detection is up to 12.98% under 10-way 1-shot. The experimental results of tiered-ImageNet are shown in Figure 5. We can observe the same experimental results as mini-ImageNet, which indicates that our method improves the performance of OOD detection, especially in the more challenging few-shot setting. C.5. Additional Ablation Study Figure 6 shows the effect of coefficient λ of the negative log determinant of FIM (-|I|) under mini-ImageNet classification and OOD detection experiments. We plot the results under 5-way {5, 20, 50}-shot. It can be observed that the best coefficients for OOD detection based on different uncertainty measures show consistency, but not with the best coefficients for accuracy. More specifically, the best coefficients for OOD detection are all around 0.01, but accuracy prefers 0.001. This is a non-trivial problem because it involves a multi-objective optimization problem. In this work, the coefficient is ultimately a compromise choice that combines the performance of image classification and OOD detection. Whether there is a better way to optimize multiple objectives at the same time remains to be explored in the future. 18 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning Table 7. OOD detection results on (ID) CIFAR10 against (OOD) CIFAR100. Each experiment is run with 5 seeds. AUPR AUROC Method Max. P α0 D. Ent. M.I. Max. P α0 D. Ent. M.I. KL-PN RKL-PN PostN EDL 61.41 ± 2.8 55.42 ± 2.6 81.96 ± 0.8 84.30 ± 0.7 61.53 ± 3.4 54.74 ± 2.8 82.06 ± 0.8 84.18 ± 0.7 60.21 ± 3.2 55.40 ± 2.9 82.34 ± 0.8 84.32 ± 0.7 61.66 ± 3.4 54.86 ± 2.9 78.64 ± 1.7 84.19 ± 0.7 57.89 ± 1.8 54.24 ± 2.4 80.49 ± 0.9 80.96 ± 0.8 58.43 ± 2.6 53.25 ± 2.9 81.17 ± 1.1 80.63 ± 1.0 55.94 ± 3.3 54.15 ± 3.0 81.51 ± 1.0 80.99 ± 0.8 58.53 ± 2.6 53.41 ± 3.0 80.22 ± 1.0 80.65 ± 1.0 I-EDL 85.35 ± 0.7 84.84 ± 0.6 85.40 ± 0.6 84.95 ± 0.7 83.55 ± 0.7 82.15 ± 0.5 83.69 ± 0.7 82.44 ± 0.5 Table 8. OOD detection results on (ID) MNIST against (OOD) KMNIST and FMNIST, (ID) CIFAR10 against (OOD) SVHN. Each experiment is run with 5 seeds. AUPR Method α0 Max. P AUROC D. Ent. M.I. Max. P α0 D. Ent. M.I. 96.18 ± 1.3 97.97 ± 0.3 96.49 ± 0.8 97.99 ± 0.3 96.22 ± 1.3 97.97 ± 0.3 97.40 ± 0.5 98.41 ± 0.4 97.48 ± 0.5 98.48 ± 0.4 97.43 ± 0.5 98.42 ± 0.4 81.06 ± 4.5 86.79 ± 1.3 80.71 ± 4.3 87.69 ± 2.1 81.05 ± 4.5 87.01 ± 1.5 MNIST → KMNIST EDL I-EDL 97.02 ± 0.7 98.34 ± 0.2 96.31 ± 0.2 98.33 ± 0.2 96.92 ± 0.9 98.34 ± 0.2 96.41 ± 1.8 98.33 ± 0.2 EDL I-EDL 98.10 ± 0.4 98.89 ± 0.3 98.08 ± 0.4 98.86 ± 0.3 98.10 ± 0.4 98.89 ± 0.3 98.09 ± 0.4 98.87 ± 0.2 96.59 ± 0.6 98.00 ± 0.3 MNIST → FMNIST 97.39 ± 0.6 98.49 ± 0.3 CIFAR10 → SVHN EDL I-EDL 78.87 ± 3.5 83.26 ± 2.4 79.12 ± 3.7 82.96 ± 2.2 78.91 ± 3.5 83.31 ± 2.5 79.11 ± 3.7 83.06 ± 2.2 80.64 ± 4.2 87.58 ± 2.0 Table 9. Comparsion of label smoothing and I-EDL on OOD detection AUPR of mini-ImageNet against CUB under 5-way and {1, 5, 20, 50}-shot settings. Each experiment is run for over 10, 000 few-shot episodes. 5-Way 1-shot 5-Way 5-shot 5-way 20-shot 5-Way 50-shot Method Max. P Ent.-α0 Max. P Ent.-α0 Max. P Ent.-α0 Max. P Ent.-α0 Label smoothing I-EDL 72.03 ± 0.2 71.95 ± 0.2 73.00 ± 0.2 74.79 ± 0.2 77.17 ± 0.2 82.04 ± 0.2 77.11 ± 0.2 82.49 ± 0.2 76.11 ± 0.2 84.31 ± 0.2 75.35 ± 0.2 85.42 ± 0.2 74.76 ± 0.2 84.68 ± 0.2 73.86 ± 0.2 84.91 ± 0.2 ∆ -0.08 1.79 4.87 5.38 8.20 10.07 9.92 11.05 C.6. Additional Analysis of Uncertainty estimation Figure 7 and 8 shows density plots of the normalized uncertainty measures for MNIST vs FMNIST, and MNIST vs KMNIST, respectively. The uncertainty measures include precision α0 , maxc pc , differential entropy and mutual information. We normalize each uncertainty value uc by uc = (uc − mini ui )/(maxi ui − mini ui ). We also report the energy distance of two distributions, with higher values indicating more separability. It can be observed that I-EDL produces sharper prediction peaks than EDL in the in-distribution (MNIST) region. Although not using OOD data, our method also makes the uncertainty of OOD data more aggregated. 19 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning Table 10. AUROC scores of OOD detection against CUB under {5, 10}-way and {1, 5, 20, 50}-shot settings. Each experiment is run for over 10, 000 few-shot episodes. 5-way 1-shot 10-way 1-shot Method Max.P α0 D. Ent. M.I. Max.P α0 D. Ent. M.I. EDL I-EDL 61.88 ± 0.27 67.46 ± 0.28 59.72 ± 0.31 70.08 ± 0.30 63.60 ± 0.31 69.51 ± 0.30 60.42 ± 0.30 70.03 ± 0.30 55.83 ± 0.22 68.81 ± 0.22 63.02 ± 0.29 69.29 ± 0.22 63.06 ± 0.27 68.80 ± 0.22 63.05 ± 0.29 69.29 ± 0.22 ∆ 5.58 10.36 5.91 9.61 12.98 6.27 5.74 6.24 5-way 5-shot 10-way 5-shot Method Max.P α0 D. Ent. M.I. Max.P α0 D. Ent. M.I. EDL I-EDL 69.71 ± 0.25 79.33 ± 0.22 72.01 ± 0.33 79.81 ± 0.22 72.97 ± 0.29 79.60 ± 0.22 72.14 ± 0.32 79.79 ± 0.22 68.59 ± 0.23 79.34 ± 0.23 73.23 ± 0.23 80.29 ± 0.22 72.70 ± 0.23 78.36 ± 0.20 73.16 ± 0.23 79.91 ± 0.21 ∆ 9.62 7.80 6.63 7.65 10.75 7.06 5.66 6.75 Method Max.P α0 D. Ent. M.I. Max.P α0 D. Ent. M.I. EDL I-EDL 76.59 ± 0.24 82.04 ± 0.21 76.16 ± 0.29 83.38 ± 0.22 76.75 ± 0.27 83.00 ± 0.21 76.18 ± 0.29 83.29 ± 0.21 71.69 ± 0.19 79.66 ± 0.16 74.08 ± 0.20 80.74 ± 0.16 73.88 ± 0.19 80.07 ± 0.15 74.05 ± 0.20 80.61 ± 0.16 ∆ 5.45 7.22 6.25 7.11 7.96 6.66 6.19 6.56 5-way 20-shot 10-way 20-shot 5-way 50-shot 10-way 50-shot Method Max.P α0 D. Ent. M.I. Max.P α0 D. Ent. M.I. EDL I-EDL 79.30 ± 0.20 82.55 ± 0.17 78.74 ± 0.24 83.17 ± 0.19 79.21 ± 0.22 83.29 ± 0.18 78.76 ± 0.24 83.20 ± 0.19 73.67 ± 0.17 77.39 ± 0.15 74.43 ± 0.18 77.80 ± 0.16 74.32 ± 0.17 77.72 ± 0.15 74.40 ± 0.18 77.78 ± 0.16 ∆ 3.25 4.43 4.08 4.44 3.72 3.37 3.40 3.38 Figure 5. Results on tiered-ImageNet {5, 10}-way {1, 5, 20, 50}-shot classification (Accuracy) and OOD detection against CUB (AUPR). The error bars are almost non-existent (i.e., less than 0.03%), since over 10, 000 trials were performed for each point. I-EDL produces statistically significant improvements over a wide range of the number of shots. 20 Uncertainty Estimation by Fisher Information-based Evidential Deep Learning Figure 6. The impact of coefficient λ of the negative log determinant of the FIM (-|I|) under 5-way mini-ImageNet classification and OOD detection experiments. Figure 7. Uncertainty representation for ID (MNIST) and OOD (FMNIST). Figure 8. Uncertainty representation for ID (MNIST) and OOD (KMNIST). 21