#

Evidential Deep Learning to Quantify Classification Uncertainty

This is a PyTorch implementation of the paper Evidential Deep Learning to Quantify Classification Uncertainty.

Dampster-Shafer Theory of Evidence assigns belief masses a set of classes (unlike assigning a probability to a single class). Sum of the masses of all subsets is $1$ . Individual class probabilities (plausibilities) can be derived from these masses.

Assigning a mass to the set of all classes means it can be any one of the classes; i.e. saying "I don't know".

If there are $K$ classes, we assign masses $b_{k} \geq 0$ to each of the classes and an overall uncertainty mass $u \geq 0$ to all classes.

$u + k = 1 \sum K b_{k} = 1$

Belief masses $b_{k}$ and $u$ can be computed from evidence $e_{k} \geq 0$ , as $b_{k} = \frac{e _{k}}{S}$ and $u = \frac{K}{S}$ where $S = \sum_{k = 1}^{K} (e_{k} + 1)$ . Paper uses term evidence as a measure of the amount of support collected from data in favor of a sample to be classified into a certain class.

This corresponds to a Dirichlet distribution with parameters $α_{k} = e_{k} + 1$ , and $α_{0} = S = \sum_{k = 1}^{K} α_{k}$ is known as the Dirichlet strength. Dirichlet distribution $D (p ∣ α)$ is a distribution over categorical distribution; i.e. you can sample class probabilities from a Dirichlet distribution. The expected probability for class $k$ is $\overset{p}{^}_{k} = \frac{α _{k}}{S}$ .

We get the model to output evidences $e = α - 1 = f (x ∣Θ)$ for a given input $x$ . We use a function such as ReLU or a Softplus at the final layer to get $f (x ∣Θ) \geq 0$ .

The paper proposes a few loss functions to train the model, which we have implemented below.

Here is the training code experiment.py to train a model on MNIST dataset.

54import torch
55
56from labml import tracker
57from labml_helpers.module import Module

#

Type II Maximum Likelihood Loss

The distribution $D (p ∣ α)$ is a prior on the likelihood $M u lt i (y ∣ p)$ , and the negative log marginal likelihood is calculated by integrating over class probabilities $p$ .

If target probabilities (one-hot targets) are $y_{k}$ for a given sample the loss is,

L (Θ) = - lo g (\int k = 1 \prod K p_{k}^{y_{k}} \frac{1}{B ( α )} k = 1 \prod K p_{k}^{α_{k} - 1} d p) = k = 1 \sum K y_{k} (lo g S - lo g α_{k})

60class MaximumLikelihoodLoss(Module):

#

evidence is $e \geq 0$ with shape [batch_size, n_classes]
target is $y$ with shape [batch_size, n_classes]

85    def forward(self, evidence: torch.Tensor, target: torch.Tensor):

#

$α_{k} = e_{k} + 1$

91        alpha = evidence + 1.

#

$S = \sum_{k = 1}^{K} α_{k}$

93        strength = alpha.sum(dim=-1)

#

Losses $L (Θ) = \sum_{k = 1}^{K} y_{k} (lo g S - lo g α_{k})$

96        loss = (target * (strength.log()[:, None] - alpha.log())).sum(dim=-1)

#

Mean loss over the batch

99        return loss.mean()

#

Bayes Risk with Cross Entropy Loss

Bayes risk is the overall maximum cost of making incorrect estimates. It takes a cost function that gives the cost of making an incorrect estimate and sums it over all possible outcomes based on probability distribution.

Here the cost function is cross-entropy loss, for one-hot coded $y$ $k = 1 \sum K - y_{k} lo g p_{k}$

We integrate this cost over all $p$

L (Θ) = - lo g (\int [k = 1 \sum K - y_{k} lo g p_{k}] \frac{1}{B ( α )} k = 1 \prod K p_{k}^{α_{k} - 1} d p) = k = 1 \sum K y_{k} (ψ (S) - ψ (α_{k}))

where $ψ (\cdot)$ is the $d i g amma$ function.

102class CrossEntropyBayesRisk(Module):

#

evidence is $e \geq 0$ with shape [batch_size, n_classes]
target is $y$ with shape [batch_size, n_classes]

132    def forward(self, evidence: torch.Tensor, target: torch.Tensor):

#

$α_{k} = e_{k} + 1$

138        alpha = evidence + 1.

#

$S = \sum_{k = 1}^{K} α_{k}$

140        strength = alpha.sum(dim=-1)

#

Losses $L (Θ) = \sum_{k = 1}^{K} y_{k} (ψ (S) - ψ (α_{k}))$

143        loss = (target * (torch.digamma(strength)[:, None] - torch.digamma(alpha))).sum(dim=-1)

#

Mean loss over the batch

146        return loss.mean()

#

Bayes Risk with Squared Error Loss

Here the cost function is squared error, $k = 1 \sum K (y_{k} - p_{k})^{2} = ∥ y - p ∥_{2}^{2}$

We integrate this cost over all $p$

L (Θ) = - lo g (\int [k = 1 \sum K (y_{k} - p_{k})^{2}] \frac{1}{B ( α )} k = 1 \prod K p_{k}^{α_{k} - 1} d p) = k = 1 \sum K E [y_{k}^{2} - 2 y_{k} p_{k} + p_{k}^{2}] = k = 1 \sum K (y_{k}^{2} - 2 y_{k} E [p_{k}] + E [p_{k}^{2}])

Where $E [p_{k}] = \overset{p}{^}_{k} = \frac{α _{k}}{S}$ is the expected probability when sampled from the Dirichlet distribution and $E [p_{k}^{2}] = E [p_{k}]^{2} + Var (p_{k})$ where $Var (p_{k}) = \frac{α _{k} ( S - α _{k} )}{S ^{2} ( S + 1 )} = \frac{p ^ _{k} ( 1 - p ^ _{k} )}{S + 1}$ is the variance.

This gives, begin{align} mathcal{L}(Theta) &= sum_{k=1}^K Big( y_k^2 -2 y_k mathbb{E}p_k + mathbb{E}p_k^2 Big) \ &= sum_{k=1}^K Big( y_k^2 -2 y_k mathbb{E}p_k + mathbb{E}p_k^2 + text{Var}(p_k) Big) \ &= sum_{k=1}^K Big( big( y_k -mathbb{E}p_k big)^2 + text{Var}(p_k) Big) \ &= sum_{k=1}^K Big( ( y_k -hat{p}_k)^2 + frac{hat{p}_k(1 - hat{p}_k)}{S + 1} Big) end{align}

This first part of the equation $(y_{k} - E [p_{k}])^{2}$ is the error term and the second part is the variance.

149class SquaredErrorBayesRisk(Module):

#

evidence is $e \geq 0$ with shape [batch_size, n_classes]
target is $y$ with shape [batch_size, n_classes]

194    def forward(self, evidence: torch.Tensor, target: torch.Tensor):

#

$α_{k} = e_{k} + 1$

200        alpha = evidence + 1.

#

$S = \sum_{k = 1}^{K} α_{k}$

202        strength = alpha.sum(dim=-1)

#

$\overset{p}{^}_{k} = \frac{α _{k}}{S}$

204        p = alpha / strength[:, None]

#

Error $(y_{k} - \overset{p}{^}_{k})^{2}$

207        err = (target - p) ** 2

#

Variance $Var (p_{k}) = \frac{p ^ _{k} ( 1 - p ^ _{k} )}{S + 1}$

209        var = p * (1 - p) / (strength[:, None] + 1)

#

Sum of them

212        loss = (err + var).sum(dim=-1)

#

Mean loss over the batch

215        return loss.mean()

#

KL Divergence Regularization Loss

This tries to shrink the total evidence to zero if the sample cannot be correctly classified.

First we calculate $\tilde{α}_{k} = y_{k} + (1 - y_{k}) α_{k}$ the Dirichlet parameters after remove the correct evidence.

K L [D (p ∣ \tilde{α}) ∥ ∥ D (p ∣ < 1, \dots, 1 >] = lo g (\frac{Γ ( \sum _{k = 1}^{K} α ~ _{k} )}{Γ ( K ) \prod _{k = 1}^{K} Γ ( α ~ _{k} )}) + k = 1 \sum K (\tilde{α}_{k} - 1) [ψ (\tilde{α}_{k}) - ψ (\tilde{S})]

where $Γ (\cdot)$ is the gamma function, $ψ (\cdot)$ is the $d i g amma$ function and $\tilde{S} = \sum_{k = 1}^{K} \tilde{α}_{k}$

218class KLDivergenceLoss(Module):

#

evidence is $e \geq 0$ with shape [batch_size, n_classes]
target is $y$ with shape [batch_size, n_classes]

242    def forward(self, evidence: torch.Tensor, target: torch.Tensor):

#

$α_{k} = e_{k} + 1$

248        alpha = evidence + 1.

#

Number of classes

250        n_classes = evidence.shape[-1]

#

Remove non-misleading evidence $\tilde{α}_{k} = y_{k} + (1 - y_{k}) α_{k}$

253        alpha_tilde = target + (1 - target) * alpha

#

$\tilde{S} = \sum_{k = 1}^{K} \tilde{α}_{k}$

255        strength_tilde = alpha_tilde.sum(dim=-1)

#

The first term begin{align} &log Bigg( frac{Gamma Big( sum_{k=1}^K tilde{alpha}_k Big)} {Gamma(K) prod_{k=1}^K Gamma(tilde{alpha}_k)} Bigg) \ &= log Gamma Big( sum_{k=1}^K tilde{alpha}_k Big) - log Gamma(K) - sum_{k=1}^K log Gamma(tilde{alpha}_k) end{align}

265        first = (torch.lgamma(alpha_tilde.sum(dim=-1))
266                 - torch.lgamma(alpha_tilde.new_tensor(float(n_classes)))
267                 - (torch.lgamma(alpha_tilde)).sum(dim=-1))

#

The second term $k = 1 \sum K (\tilde{α}_{k} - 1) [ψ (\tilde{α}_{k}) - ψ (\tilde{S})]$

272        second = (
273                (alpha_tilde - 1) *
274                (torch.digamma(alpha_tilde) - torch.digamma(strength_tilde)[:, None])
275        ).sum(dim=-1)

#

Sum of the terms

278        loss = first + second

#

Mean loss over the batch

281        return loss.mean()

#

Track statistics

This module computes statistics and tracks them with labml tracker .

284class TrackStatistics(Module):

#

292    def forward(self, evidence: torch.Tensor, target: torch.Tensor):

#

Number of classes

294        n_classes = evidence.shape[-1]

#

Predictions that correctly match with the target (greedy sampling based on highest probability)

296        match = evidence.argmax(dim=-1).eq(target.argmax(dim=-1))

#

Track accuracy

298        tracker.add('accuracy.', match.sum() / match.shape[0])

#

$α_{k} = e_{k} + 1$

301        alpha = evidence + 1.

#

$S = \sum_{k = 1}^{K} α_{k}$

303        strength = alpha.sum(dim=-1)

#

$\overset{p}{^}_{k} = \frac{α _{k}}{S}$

306        expected_probability = alpha / strength[:, None]

#

Expected probability of the selected (greedy highset probability) class

308        expected_probability, _ = expected_probability.max(dim=-1)

#

Uncertainty mass $u = \frac{K}{S}$

311        uncertainty_mass = n_classes / strength

#

Track $u$ for correctly predictions

314        tracker.add('u.succ.', uncertainty_mass.masked_select(match))

#

Track $u$ for incorrect predictions

316        tracker.add('u.fail.', uncertainty_mass.masked_select(~match))

#

Track $\overset{p}{^}_{k}$ for correctly predictions

318        tracker.add('prob.succ.', expected_probability.masked_select(match))

#

Track $\overset{p}{^}_{k}$ for incorrect predictions

320        tracker.add('prob.fail.', expected_probability.masked_select(~match))