Russian Mathematical Surveys
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive
Impact factor
Submit a manuscript

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Uspekhi Mat. Nauk:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Russian Mathematical Surveys, 2024, Volume 79, Issue 6, Pages 991–1015
DOI: https://doi.org/10.4213/rm10208e
(Mi rm10208)
 

This article is cited in 3 scientific papers (total in 3 papers)

Extrapolation of the Bayesian classifier with an unknown support of the two-class mixture distribution

K. S. Lukyanovabc, P. A. Yaskovde, A. I. Perminovac, A. P. Kovalenkof, D. Y. Turdakovac

a Ivannikov Institute for System Programming of the Russian Academy of Science, Moscow, Russia
b Moscow Institute of Physics and Technology (National Research University), Moscow, Russia
c Research Center for Trusted Artificial Intelligence, ISP RAS, Moscow, Russia
d Steklov Mathematical Institute of Russian Academy of Sciences, Moscow, Russia
e National University of Science and Technology MISIS, Moscow, Russia
f Academy of Cryptography of Russian Federation, Moscow, Russia
References:
Abstract: This work introduces a method aimed at enhancing the reliability of the Bayesian classifier. The method involves augmenting the training dataset, which consists of a mixture of distributions from two original classes, with artificially generated observations from a third, ‘background’ class, uniformly distributed over a compact set that contains the unknown support of the original mixture. This modification allows the value of the discriminant function outside the support of the training data distribution to approach a prescribed level (in this case, zero). Adding a decision option for ‘Refusal to Classify’, triggered when the discriminant function takes sufficiently small values, results in a localized increase in classifier reliability. Specifically, this approach addresses several issues: it enables the rejection of data that differs significantly from the training data; facilitates the detection of anomalies in input data; and avoids decision-making in ‘boundary’ regions when separating classes.
The paper provides a theoretical justification for the optimality of the proposed classifier. The practical utility of the method is demonstrated through classification tasks involving images and time series.
Additionally, a methodology for identifying trusted regions is proposed. This methodology can be used to detect anomalous data, cases of parameter shifts in class distributions, and areas of overlap between the distributions of the original classes. Based on these trusted regions, quantitative metrics for classifier reliability and efficiency are introduced.
Bibliography: 23 titles.
Keywords: machine learning, Bayesian classifier, trusted machine learning, interpretability, out-of-distribution (OOD), image classification, time series classification, rejection of classification, background class.
Funding agency Grant number
Ministry of Science and Higher Education of the Russian Federation 000000D730321P5Q0002
70-2021-00142
This work was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Ivannikov Institute for System Programming of the Russian Academy of Sciences dated November 2, 2021 No. 70-2021-00142.
Received: 05.09.2024
Published: 20.02.2025
Bibliographic databases:
Document Type: Article
UDC: 004.8+519.6
MSC: 62H30, 68T10
Language: English
Original paper language: Russian

1. Introduction

The Bayesian classifier serves as a foundational mathematical model in machine learning (ML) applications [6], providing the theoretical basis for solving classification problems across diverse applied domains.

Widely utilized in contexts such as image recognition, natural language processing, and anomaly detection, the Bayesian classifier relies on probabilistic reasoning to make informed decisions based on observed data. However, a notable limitation of this classifier is its susceptibility to uncertainties outside the training distribution, posing challenges in real-world applications where reliability is paramount.

Practical challenges such as anomalies [13], [8], data drift, out-of-distribution (OOD) data [19], [7], [10], or overlapping data distributions complicate the deployment of existing ML models, particularly concerning their ability to make reliable decisions under uncertainty. A critical issue is the lack of mechanisms enabling models to abstain from making decisions when faced with incomplete or ambiguous data — a shortcoming that undermines the reliability and trustworthiness of machine learning systems.

Data drift, characterized by temporal shifts in data distributions, poses a significant challenge to maintaining model quality and accuracy over time. These shifts can result from changes in user behaviour, environmental conditions, or evolving data sources. Addressing such data requires adaptive strategies to ensure model reliability and stability. To tackle issues related to data drift and out-of-distribution (OOD) scenarios, a variety of modern approaches have emerged, including statistical methods [2], [23], machine learning-based detection models [3], and uncertainty estimation algorithms [21], [12], [14].

However, these methods have notable limitations. Statistical methods are highly sensitive to the choice of criteria. Machine learning-based approaches fail to provide a universal solution, as they often replace one model susceptible to OOD issues by another facing similar challenges. Effective uncertainty estimation techniques rely on the statistical analysis of prediction distributions using methods such as Monte Carlo dropout, which incur high computational costs due to the need for repeated training. This complexity hinders their applicability in many practical tasks. Additionally, heuristic-based methods lack rigorous mathematical justification, making it difficult to determine the boundaries of their applicability.

In recent years regulatory requirements in the field of machine learning (ML) placed significant emphasis on the aspect of trust in modern ML models. As a result, methodologies for evaluating the trustworthiness of ML technologies are currently of paramount importance [11], [20], [16], [17]. When developing reliable ML models, approaches that incorporate the ability to abstain from decision-making represent a substantial paradigm shift. By enabling a model to refuse predictions in uncertain or unfamiliar contexts, risks associated with specific erroneous decisions can be mitigated, thereby improving the overall reliability of systems leveraging ML ensembles. The development of ML model ensembles, each designed to function reliably and cohesively within complex data spaces, holds great promise for addressing real-world challenges and driving innovation across diverse domains.

In this article we propose a modification to the Bayesian classifier aimed at enhancing trust in ML systems. By incorporating a uniform ‘background’ into the training dataset during the learning phase, the decision-making process of the classifier is refined, improving its adaptability to uncertain and OOD scenarios. Experimental results validating the proposed approach are presented for classification tasks involving images and time series.

2. Bayesian classifier extrapolation method

This section outlines the problem statement and the method proposed for modifying classifier training, and provides a theoretical justification for the optimality of the resulting classifier. A geometric interpretation of the neural network is presented, along with a method for constructing an explanatory binary tree. Additionally, a methodology for constructing trusted regions is developed for a classifier based on a fully connected neural network with piecewise linear activation function.

2.1. Classification problem and Bayesian classifier

Let $D = (X, Y)$ be a random vector with a certain distribution $P$, where $X \in [0, 1]^d$ and $Y \in \{\pm 1\}$. Denote the marginal distribution of $X$ corresponding to $P$ by $P_X$. We will refer to values of $X$ as features and values of $Y$ as class labels. The primary goal in a classification problem is to construct a discriminant function $f\colon [0, 1]^d \to \{\pm 1\}$ which maps feature values to the appropriate class labels.

In terms of distributions, the general classification problem can be expressed1 as

$$ \begin{equation} P(Y \ne f(X)) \to \min_{f}, \end{equation} \tag{1} $$
where $\min$ is taken over all (Borel-measurable) functions with values in $\{\pm 1\}$. Similarly, the regression problem in these terms is formulated as follows:
$$ \begin{equation} E(Y-f(X))^2 \to \min_{f}, \end{equation} \tag{2} $$
where $E$ denotes the expectation concerning $P$, and the minimum is taken over all (Borel-measurable) functions on $[0, 1]^d$.

The solution to the regression problem is the conditional expectation

$$ \begin{equation*} g(x)=E(Y\mid X=x)=2P(Y=1\mid X=x)-1,\qquad x\in[0,1]^d, \end{equation*} \notag $$
as follows from the elementary relation
$$ \begin{equation*} E(Y-f(X))^2=E(Y-g(X))^2+E(g(X)-f(X))^2. \end{equation*} \notag $$
It should be noted that, in general, $g(x)$ is uniquely defined on $[0, 1]^d$ only $P_X$-almost surely. Specifically, outside the support $ \mathbb{S}$ of the distribution of the vector $X$ in $[0, 1]^d$ the conditional expectation function $g(x)$ can take arbitrary values.

The solution to the classification problem is the Bayesian classifier

$$ \begin{equation} s(x)=\begin{cases} 1&\text{if}\ g(x)>0\ \text{and}\ x\in\mathbb S, \\ \text{any of}\ \pm1&\text{if}\ g(x)=0\ \text{or}\ x\notin\mathbb S, \\ -1&\text{if}\ g(x)<0\ \text{and}\ x\in\mathbb S, \end{cases} \end{equation} \tag{3} $$
corresponding to $g$ (the given version of the conditional expectation). This follows from the fact that for any $f$ with values in $\{\pm1\}$ the following holds:
$$ \begin{equation*} 4P(Y \ne f(X))=E(Y-f(X))^2=E(Y-g(X))^2+E(g(X)-f(X))^2. \end{equation*} \notag $$
From (3), the uncertainty region of the Bayesian classifier corresponding to $g$ is given by the set $[0,1]^d\setminus \mathbb S\cup \{x\colon g(x)=0\}$.

In practice, the distribution $P$ is generally unknown. However, it is typically assumed that a sample from $P$ is available. To approximate the Bayesian classifier, empirical analogues of the optimization problems (1) and (2) are used, incorporating regularization and various constraints on the classes of functions $f$ over which optimization is performed.

2.2. Problem statement

A fundamental challenge in machine learning lies in the fact that both training and subsequent inference are justified only within the support of the data distribution. As discussed earlier, the region outside the support $\mathbb{S}$ of the random vector $X$ falls into the uncertainty zone of the Bayes classifier. Despite this, machine learning methods typically do not evaluate the boundaries of $\mathbb{S}$ and provide instead specific classification rules over the entire compact set $[0, 1]^d$, including outside $\mathbb{S}$. In cases of shifts or distortions in the data distribution (both in the training and test sets), conclusions made outside $\mathbb{S}$ can be erroneous. In such cases it would naturally be prudent to abstain from making a decision. However, as mentioned in the introduction, popular methods often lack automatic mechanisms for abstaining, which poses practical challenges.

Let us examine in more detail when it is appropriate to abstain from making a decision:

1. Outliers. A point is located far from the rest, and the model has no information on how to classify it. To address this, data preprocessing is often applied, using specialized methods for detecting outliers. However, these methods frequently involve hyperparameters and remove data before model training rather than during the process.

2. Out-of-Distribution (OOD) Problem. If the data distribution changes, the model cannot make accurate predictions. Solutions typically fall into three categories: statistical methods, shift modeling, and detection using machine learning techniques. Statistical methods are numerous, but detection results can be highly sensitive to the choice of the method and hyperparameters. Shift modeling has the drawback that some variations can remain overlooked, as it requires assumptions about the prior data distribution and its temporal behaviour, which are challenging to automate. For machine learning-based detection models an OOD problem can re-emerge for the detector itself.

3. Overlapping Class Distributions. In overlapping regions it is impossible to assign a definitive label to a new point since the probabilities for each class can be nearly identical. In practice, it is better to refer such cases to a human expert, who can use additional information to decide in the specific context.

Abstaining from making a decision is justified in the uncertainty zone of the Bayes classifier, that is, when observations fall in the set $[0, 1]^d \setminus \mathbb{S} \cup \{x\colon g(x) = 0\}$ (see § 2.1). The primary issue is that the support $\mathbb{S}$ is in general unknown. The following section demonstrates how this problem can be addressed without estimating $\mathbb{S}$ directly.

2.3. Modification of Bayesian classifier

Our approach to modifying the Bayes classifier is its extrapolation using uniformly distributed noise on the compact domain. The method involves adding a portion of randomly generated data with features uniformly distributed on the compact domain and zero labels to the original data.

After such a modification the distribution of the data vector $(X, Y)$ in $[0, 1]^d \times \{\pm 1, 0\}$ will represent a mixture of two distributions:

$$ \begin{equation*} P_\alpha=(1-\alpha)P+\alpha \mathcal P, \end{equation*} \notag $$
where $\alpha \in (0, 1)$, $P$ is the original data distribution, and $\mathcal{P}$ is the distribution of a random vector, where the last component is 0, and the first $d$ components form a vector uniformly distributed over $[0, 1]^d$. By definition, the distribution of $X$ under $P_\alpha$ will have the form:
$$ \begin{equation*} \lambda_\alpha=(1-\alpha)P_X+\alpha \lambda, \end{equation*} \notag $$
where $\lambda$ is the Lebesgue measure on $[0, 1]^d$, and $P_X$ is the distribution of $X$ when the vector $(X, Y)$ follows $P$. Let $E_\alpha$ denote the expectation with respect to $P_\alpha$ and $\mathbb{S}$ be the support of the distribution $P_X$.

Due to the Lebesgue decomposition and the Radon–Nikodym theorem, there will always exist a non-negative integrable function $\rho$ on $[0, 1]^d$ and a Borel set $A \subseteq \mathbb{S}$ with Lebesgue measure zero such that:

$$ \begin{equation*} P_X(B)=\int_B\rho(x)\,dx+P_X(A\cap B) \end{equation*} \notag $$
for all Borel sets $B$ in $[0, 1]^d$.

Theorem. For every $\alpha \in (0, 1)$ the solution $g_\alpha$ to the regression problem

$$ \begin{equation} E_\alpha(Y-f(X))^2 \to \min_{f} \end{equation} \tag{4} $$
exists, is unique $P_X$- and $\lambda$-almost surely, and can be expressed as
$$ \begin{equation} g_\alpha(x)=\begin{cases} g(x)& \textit{if}\ x\in A, \\ \dfrac{(1-\alpha)g(x)\rho(x)}{\alpha+(1-\alpha)\rho(x)}& \textit{if}\ \rho(x)>0\ \textit{and}\ x\in \mathbb S\setminus A, \\ 0& \textit{if either}\ \rho(x)=0\ \textit{and}\ x\in \mathbb S\setminus A,\ \textit{or}\ x\notin\mathbb S, \end{cases} \end{equation} \tag{5} $$
where $\min$ is taken over all (Borel) functions $f$ and
$$ \begin{equation*} g(x)=E(Y\mid X=x)\quad\textit{on}\ \ [0,1]^d. \end{equation*} \notag $$

The classifier $s_\alpha = s_\alpha(x)$, for $x \in [0, 1]^d$, defined by formula (3) for $g$ replaced by any solution $g_\alpha$ of problem (4) and $\mathbb{S}$ replaced by $[0, 1]^d$, possesses the following properties:

(i) $s_\alpha$ minimizes the classification problem:

$$ \begin{equation*} P(Y \ne f(X)) \to \min_{f}, \end{equation*} \notag $$
where $\min$ is taken over all (Borel) functions with values $\pm 1$;

(ii) the uncertainty region of $s_\alpha$ is the set $\{x \in [0, 1]^d\colon g_\alpha(x) = 0\}$, which covers the set $[0, 1]^d \setminus \mathbb{S}$, where $\mathbb{S}$ is the support of the distribution $P_X$, $\lambda$-almost everywhere.

Proof. As shown in § 2.1, the solution to problem (4) exists and coincides with the conditional expectation
$$ \begin{equation*} g_\alpha(x)=E_\alpha(Y\mid X=x),\qquad x\in[0,1]^d, \end{equation*} \notag $$
which is uniquely defined almost surely with respect to $\lambda_\alpha$, that is, it is simultaneously uniquely defined almost surely with respect to $P_X$ and $\lambda$.

To derive an explicit formula for $g_\alpha$, observe the following. First, $E_\alpha g_\alpha^2(X) < \infty$ because the values of $Y$ are bounded. Therefore, in (4) we can exclude all (Borel) functions $f$ such that $E_\alpha f^2(X) = \infty$, and treat the last integral as finite. Then, by the definition of $P_\alpha$,

$$ \begin{equation*} \begin{aligned} \, E_\alpha(Y-f(X))^2&=(1-\alpha)E(Y-f(X))^2+\alpha\int_{[0,1]^d}f^2(x)\,dx \\ &=(1-\alpha)E(Y-g(X))^2+L(f), \end{aligned} \end{equation*} \notag $$
where
$$ \begin{equation*} L(f)=(1-\alpha)E(g(X)-f(X))^2+\alpha\int_{[0,1]^d}f^2(x)\,dx. \end{equation*} \notag $$
We rewrite $L(f)$ as
$$ \begin{equation*} \begin{aligned} \, L(f)&=(1-\alpha)\int_A(g(x)-f(x))^2\,P_X(dx) \\ &\qquad+\int_{\{x\in \mathbb S\setminus A\colon\rho(x)>0\}}\bigl[(1-\alpha) (g(x)-f(x))^2\rho(x)+\alpha f^2(x)\bigr]\,dx \\ &\qquad+\alpha\int_{[0,1]^d\setminus \mathbb S\cup \{x\in\mathbb S\setminus A\colon\rho(x)=0\}}f^2(x)\,dx. \end{aligned} \end{equation*} \notag $$
Since the minimum of the expression
$$ \begin{equation*} (1-\alpha)(a-z)^2b+\alpha z^2 \end{equation*} \notag $$
for $b > 0$ is achieved at the point
$$ \begin{equation*} z_*=\frac{(1-\alpha)ab}{\alpha+(1-\alpha)b}\,, \end{equation*} \notag $$
the minimum of $L(f)$ is obviously achieved at $g_\alpha$ given by (5). Rewriting the expression above, we obtain:
$$ \begin{equation*} (1-\alpha)(a-z)^2b+\alpha z^2=(\alpha+(1-\alpha)b)(z-z_*)^2+ \frac{\alpha(1-\alpha)a^2b}{\alpha+(1-\alpha)b}\,. \end{equation*} \notag $$
Therefore, up to the term $R_\alpha$, which does not depend on $f$, for $g_\alpha$ defined by (5) the following relation holds:
$$ \begin{equation*} L(f)=(1-\alpha)\|g_\alpha-f\|_{P_X}^2+ \alpha\|g_\alpha-f\|_\lambda^2+R_\alpha, \end{equation*} \notag $$
where $\|\,{\cdot}\,\|_{\mu}$ is the $L_2$-norm with respect to the measure $\mu = P_X$ or $\lambda$, and we note that $g_\alpha$ equals zero outside $\mathbb{S}$ or when $\rho(x) = 0$, and $g_\alpha = g$ on $A$. Since all solutions to problem (4) are uniquely defined almost surely with respect to $P_X$ and $\lambda$, this relation with $L(f)$ also holds for any other solution $g_\alpha$, not necessarily defined by (5).

Let $s_\alpha$ be the classifier from the theorem. To verify (i) it is sufficient to show that $P(s(X) = s_\alpha(X)) = 1$, where $s$ is the Bayes classifier from (3) for $g(x)=E(Y\mid X=x)$ on $[0,1]^d$. This is equivalent to showing that

$$ \begin{equation*} P\bigl(g(X)g_\alpha(X)>0\ \text{or}\ g(X)=g_\alpha(X)=0\bigr)=1. \end{equation*} \notag $$
Since $g_\alpha$ is uniquely defined almost surely with respect to $P_X$ and can be expressed by formula (5), we can assume that $g_\alpha$ is everywhere given by this formula. It remains to observe that the condition
$$ \begin{equation*} g(x)g_\alpha(x)>0\quad\text{or}\quad g(x)=g_\alpha(x)=0 \end{equation*} \notag $$
is satisfied by definition when either $x \in A$, or $\rho(x) > 0$ and $x \in \mathbb{S} \setminus A$, while the remaining cases have zero probability:
$$ \begin{equation*} P(\rho(X)=0,X\in\mathbb S\setminus A)=P(X\not\in\mathbb S)=0. \end{equation*} \notag $$
Thus, the proof of (i) is complete.

Property (ii) follows from (5) and the fact that $g_\alpha$ is uniquely defined almost surely with respect to $\lambda$. $\Box$

Remark. If instead of (4) we consider a problem of the form

$$ \begin{equation} E_\alpha(Y-f(X))^2+\operatorname{Pen}(f)\to \min_{f\in \mathcal F}, \end{equation} \tag{6} $$
where $\mathcal{F}$ is some parametric family of functions (for example, neural networks with a specified architecture), and $\operatorname{Pen}(f)$ is a regularization penalty term, then, as follows from the proof of the theorem above, in terms of minimizing functions in (6) the problem is equivalent to the problem of approximating $g_\alpha$ — any solution to (4) — by a function from the class $\mathcal{F}$, taking the penalty $\operatorname{Pen}(f)$ into account:
$$ \begin{equation} (1-\alpha)\|g_\alpha-f\|_{P_X}^2+\alpha\|g_\alpha-f\|_{\lambda}^2+ \operatorname{Pen}(f)\to \min_{f\in \mathcal F}, \end{equation} \tag{7} $$
where $\|\,{\cdot}\,\|_{\mu}$ is the $L_2$-norm with respect to the measure $P_X$ or $\lambda$.

As noted in § 2.1, in practice, the distribution of data $P$ is unknown, but a sample from $P$ is available, and the empirical analogue of (6) is solved by using machine learning algorithms. In this case, the uncertainty region for the classifier in question, where a decision is not made, should naturally be increased by introducing an additional hyperparameter $\beta$. Specifically, a decision should be withheld when $|\mathbf{f}| \leqslant \beta$, where $\mathbf{f}$ is an estimate for the optimal solution to the empirical analogue of (6).

The practical meaning of the parameter $\beta$ is the minimum level of confidence for the classifier to make a decision. The limiting case $\beta = 0$ corresponds to the case when there is an infinite amount of data (that is, the sample size is infinite), effectively when the distribution $P$ is known.

2.4. Geometric interpretation of neural network

To use the method not only for identifying trusted and untrusted areas but also for determining precisely the cause of distrust in an area and/or classifying test points as anomalies, intersections of distributions, and so on, it is necessary to be able to evaluate individual regions of the space. But first, it is important to understand how the neural network divides the space into regions, for which the geometric interpretation of the neural network can help.

The geometric interpretation of the first layer of the neural network

For simplicity let us consider the input space dimension $d$, where $d = 2$, that is, points on a plane. In this case each neuron of the first layer represents a separating plane, to which an activation function is applied:

$$ \begin{equation*} \operatorname{neuron}=\sigma(w_{1} \cdot x_{1}+w_{2} \cdot x_{2}+b), \end{equation*} \notag $$
where $\sigma$ is any piecewise linear activation function, such as $\operatorname{ReLU}$ or $\operatorname{Abs}$.

Since the activation function is piecewise linear, the result of constructing the network will be a continuous piecewise linear function.

The lines formed by the first layer (Fig. 1) divide the input data space into cells. The analytic form of these cells can be obtained by solving systems of linear inequalities derived from the equations of the lines and by checking the results for positivity or negativity.

By considering all possible inequality signs, one can obtain systems of linear inequalities for all the cells formed. Such systems do not always have a solution, which is why the number of systems of inequalities corresponding to the cells will be significantly smaller than $2^k$, where $k$ is the number of neurons in the layer. Specifically, as noted in [4],

$$ \begin{equation*} N_{\rm cells}=\begin{cases} 2^k, & k \leqslant d, \\ \displaystyle\sum_{i=0}^d \binom{k}{i},& k > d. \end{cases} \end{equation*} \notag $$

Thus, the first layer of the network performs a partition of the input space into cells in which the input points are located.

The geometric interpretation of subsequent layers

The next layer of the neural network partitions similarly each cell formed by the first layer. As a result, new cells form within the cells already established, creating a hierarchical partition of the original compact domain. Each network layer performs a division of the cells from the previous layer into smaller fragments (cells). The final layer, which implements the sign function activation, completes the partitioning process and makes the final decision on which class a point entering the final cell should belong to. Thus, the sequence of signs of linear inequalities generated by each layer of the network ultimately forms a unique index of the cell in which a point lands after classification.

2.5. Construction of an explanatory binary tree

The process of forming the explanatory binary tree (eXBTree) follows directly from the scheme of cell formation in the first layer and the subsequent division of cells by the following layers of the network. The root of the tree is the hyperplane formed by the first neuron of the first layer. The root node includes all points from the original compact domain. Points that lie ‘below/to the left of’ the hyperplane go to the left-hand node, while the rest go to the right-hand node. The second neuron of the first layer partitions each of the nodes obtained previously, and so on, until all neurons of the first layer are considered. As a result, the compact domain is divided into cells by the first layer of the network. Then each cell of the first layer is subdivided by the neurons of the second layer, and so on, layer by layer, until all neurons in all layers, including the final output layer, have been considered. The result is a complete explanatory binary tree.

It is not necessary to complete the process of forming the whole tree, as, at some point, a node can contain points that belong to only one class (Fig. 2). There is no need to divide such a cell further, as it will not affect the final decision.

The most important property of the tree formed is its good interpretability from a geometric perspective. It is easy to trace the path of an arbitrary input example through the tree, and much more difficult to do this in a fully connected network, where ‘signals’ propagate to all neurons independently. Moreover, if the number of points is sufficiently large and the number of cells is small, the cells formed in the last layer can be considered as a histogram of the original distribution, and the frequencies of elements from different classes can be analyzed to possibly make a different decision than the network/tree did. If a cell contains roughly an equal number of elements from both classes in the training sample, then it is likely not advisable to trust the decision of the network.

However, most networks tend to have a large number of neurons, resulting in a large number of cells, many of which are empty or contain only a single training example. In such cases discussing frequencies becomes meaningless, but information about ‘nearby’ examples can be used. Consider the following scenario: a test sample is input, and it lands in an empty cell. The neural network will still produce a result — a number whose sign will determine the classification. To confirm or refute this result the labels of the training examples located ‘close’ to the tested sample can be analyzed. To do this one could, for instance, examine the training examples belonging to the cell of the previous layer of the explanatory tree. If there are not enough examples there, one can move up to higher layers. In this way the classification result can be explained by analyzing precedents. It should be noted, however, that the proposed approach to searching for precedent ‘neighbours’ does not fully correspond to the geometric proximity of points in the original space and depends on the results of the network’s training.

2.6. Trusted area allocation technique using a tree

The methodology will be presented for the case where the parametric family of decision functions consists of fully connected neural networks with fixed architecture, using piecewise linear activation functions. The output layer of the network contains a single neuron, the sign of whose value determines the membership to one of the two classes.

$\bullet$ Trusted Area Allocation Methodology.

$\bullet$ Extension to solve practical problems.

2.7. Time and space complexity of eXBTree

To determine the class membership of a point, one must traverse the tree from the root to the leaf. In the worst case, the depth of the tree corresponds to the number of neurons, so the time complexity is $O(N)$, where $N$ is the number of neurons in the entire network. In the best case, if the input data is linearly separable, the classifier consists of only one node.

The spatial complexity in the simplest case coincides with the complexity of the neural network itself. Each node of the tree also stores equations in the form of weight coefficients of the lines. However, owing to the use of analytic expressions, this can be reduced in specific cases, but in general it remains $O(M)$, where $M$ is the total number of weight coefficients in the neural network.

3. Experiments

This section presents the results of experiments conducted for time series and image classification tasks. All main experiments and visualizations were performed using a tool created by the authors, the ‘Dense Network Visualizer’2. More detailed information about the tool’s functionality is provided in Appendix App_B.

All experiments were conducted on a desktop PC with 32 GB of DDR4 RAM and an Intel Core i7-11700 processor.

3.1. Image classification

In the experiments two well-known image classification datasets were used: MNIST and CIFAR-10. The MNIST dataset consists of 70 000 grayscale images of handwritten digits (0–9), while CIFAR-10 contains 60 000 colour images in ten different classes.

A binary classification approach was implemented for the training procedure. Each class was trained separately (one against the others), with the target class labelled as 1 and all other classes as $-1$. This approach differs from the typical use of labels 1 and 0. It is important to note that the output layer of the models did not use an activation function. This architectural choice allowed for direct classification based on the sign of the output value.

Initially, the models were trained using standard training data, and their performance was evaluated using common metrics such as accuracy and the F1 score. Subsequently, an additional step was introduced into the training process: at each epoch noise generated uniformly in the compact set $[0, 1]$ was added. The model was trained to predict the label 0 for noise points. This augmentation of the training set was aimed at teaching the model to distinguish between valid input data and noise.

To evaluate models trained on this extended dataset, a confidence interval of $[-0.3, 0.3]$ was used (with hyperparameter $\beta = 0.3$, as described in § 2.3). If the output value of the model fell within this interval, the prediction was considered unreliable and was rejected. This modification enhanced the reliability of the model’s predictions.

The results showed that models trained on the extended training dataset achieved higher performance metrics (Fig. 3 and Table 1) compared to the standard approach. Some data points fell in the unreliable prediction region, which reduced the overall number of reliable predictions. A visual inspection of the input data that was rejected for classification subjectively confirms the ‘uncertain’ decision made by the modified classifier (Fig. 4).

Table 1.Classification performance on MNIST with additional noise (average over 30 runs)

Training typeAccuracyRejection rate
standard training$97.2\pm 0.8\%$0%
with 100% noise$98.8\pm 0.3\%$$4.8\pm 2.4\%$
with 1000% noise$97.9\pm 1.2\%$$8.5\pm 1.6\%$

3.2. Classification of time series

The task of predicting hard disc failures was addressed using the Backblaze Drive dataset [9]. This dataset has been used as a training dataset for many years and includes data on the operation of hard drives over the years and days within each year, covering data from different manufacturers, with over 100 000 records per day. It also includes all SMART indicators of the discs, forming a multivariate time series for each year.

For the experiments, 20 features (SMART indicators) were selected, forming an extended feature space, and a subset of 7 features from these 20 was chosen for experiments with a reduced feature space. The selection was based on the most frequently used feature combinations in similar works [15], [1], [22]. The SMART indicators selected include:

$$ \begin{equation*} 1,\ 3,\ 4,\ 5,\ 7,\ 9,\ 10,\ 12,\ 187,\ 188,\ 190,\ 192,\ 193,\ 194,\ 197,\ 198,\ 199,\ 240,\ 241,\ 242. \end{equation*} \notag $$
Detailed descriptions of the selected SMART indicators and the data preprocessing procedures for the experiments are available in Appendix App_C. The final class balance after preprocessing was 56% to 44%.

The code for data preparation and running a series of experiments for the time series classification task, along with the data prepared, is published on GitLab: https://gitlab.com/LukianovKirill/trustai_72.

The classification model used was a fully connected neural network. The loss function was MSE; the optimizer was Adam (with default optimizer hyperparameters); the learning rate was $0.001$; the regularization was L2 with a regularization coefficient of $0.001$; the batch size was 16; and the number of epochs was 1200 for the case of 7 features and 600 for the case of 20 features. The input layer size was 7 and 20, respectively, with two hidden layers, each containing 10 neurons plus a bias, and the activation function was $\operatorname{ReLU}$. The output layer size was 1, and classification was performed based on the sign of the output or rejection of the decision if the absolute value of the output was smaller than $\beta$. For experiments with noise the number of noise points was equal to the number of real data points in the prepared datasets, and the decision rejection threshold was set to $\beta = 0.1$.

Although many studies have been conducted based on the dataset in question, comparing the results is challenging due to the use of different subsets of the dataset, data from different years, varying class distributions, and different model types. For the proposed method, experiments were conducted on a fully connected network with a fixed set of features. The metrics for the trained model in the case of 7 features are as follows (averaged over 20 runs, the primary class being the failure class):

$$ \begin{equation*} \text{accuracy:}\ 78.8\pm 3.1\%,\qquad \text{F1:}\ 75.4\pm 4.3\%,\quad\text{and}\quad \text{AUROC:}\ 77.6\pm 3.9\%. \end{equation*} \notag $$
Previous works computed accuracy for imbalanced classes, which led to an overestimation of performance, so comparing the results obtained with those works is not appropriate. This issue was also noted in [15], where the authors used subsets of the larger class. However, these subsets are not publicly available, making a full comparison impossible. Their neural network achieves an AUROC of $68.4 \pm 6.5\%$. Thus, the model obtained in this study is comparable in quality to similar models described in the literature.

Table 2 presents the results after adding noise. Quality improved in both scenarios, with a significant portion of errors being classified as rejection. Although this training approach does not result in a substantial improvement in accuracy, the stability of the prediction is maintained when the rejection is included.

Table 2.Hard disc failure classification performance with additional noise (averaged over 20 runs)

Training typeAccuracyRejection rate
standard training, 7 features$78.8\pm 3.1\%$0%
with noise, 7 features$82.2\pm 2.4\%$$7.7\pm 2.3\%$
standard training, 20 features$77.6\pm 3.3\%$0%
with noise, 20 features$81.2\pm 2.8\%$$8.5\pm 2.4\%$

Therefore, after training, the methodology outlined in the previous section can be applied to identify trusted areas and address the issue of anomaly detection and outliers in current HDD failure prediction tasks [18], [5], confirming the practical significance of the proposed classification method.

4. Conclusion

The paper presents a method for enhancing the reliability and robustness of a Bayesian classifier by incorporating uniform noise into the training dataset. A theoretical proof of the optimality of the proposed extrapolation method is provided. This approach allows the classifier to make decisions to reject classification in regions where the training data points are underrepresented. The implementation of the method as a fully connected neural network model with piecewise-linear activation functions offers an effective solution to the problems of OOD detection, outlier detection, and constructing class distribution intersection areas. Experiments with image classification and time-series tasks confirmed the practical advantages of this method, demonstrating its ability to improve decision boundaries and facilitate rejection in ambiguous situations.

In general, the idea of modifying the training dataset by adding a uniformly distributed background (noise) improves the local reliability of a Bayesian classifier under constraints on the amount of training data. The features of the neural network model allow for the determination of trust regions as cells defined by systems of linear inequalities, within which the error probability does not exceed a specified level. Based on these regions, the effectiveness of the classifier can be assessed quantitatively as the ratio of the number of observations that fall within trust regions to the total number of observations. Further improvements in classification efficiency can be achieved by forming ensembles of classifiers that operate in different initial feature spaces.

Appendix A. Demonstration of the application of the methodology using a simple example

In this appendix the steps of the methodology are demonstrated using a simple two-dimensional example.

Consider the standard simple data distribution in the form of two spirals for the two-dimensional case, as shown in Fig. 5.

A fully connected neural network was used as the classification model. Loss function: MSE; optimizer: Adam (default optimizer hyperparameters); learning rate: $0.001$; regularization: L2; regularization coefficient: $0.001$; batch size: 16; number of epochs: 1 500 for the noise case, 600 for the noise-free case. Input layer size: 2; two hidden layers, each with 10 neurons plus bias; activation function: $\operatorname{ReLU}$. Output layer size: 1; classification was performed based on the sign of the output or rejection of the decision if the absolute value of the output was less than $\beta$. For experiments with noise the number of noise points was equal to the number of real data points in the prepared datasets, and the decision error threshold was set to $\beta = 0.1$.

Without the addition of noise to annul the approximating function, the output model is shown in Fig. 6; accuracy is 96.8%. It can be seen that outside the spirals the function behaves arbitrarily (in this case, within a compact region). For example, if the spirals are further extended, the model error increases sharply. Similarly, if there is a shift in the data, the error also increases significantly.

When noise is added and the rejection threshold is set to $\beta = 0.1$, the approximating function is practically nullified outside the spirals, and according to the threshold, the entire external area becomes untrusted. The output model is shown in Fig. 7; accuracy is 97.8%. In the case of distribution shifts, outliers, or other practical issues, their prompt detection can be performed.

If an eXBTree is constructed and the cells are sorted by the number of data points in them, then of the 259 tree leaves, more than 85% of all data will be concentrated in the 32 largest leaves. Almost all cells with a large amount of data will contain between 10 and 60 points for one class and 0 points for the other class. The value of $R_{\rm leaf}$ will range from 0.3 to 0.9. In Fig. 8 the two largest trusted cells for each class are highlighted.

Appendix B. DenseNetworkVisualizer

”DenseNetworkVisualizer is an advanced system based on Javascript, which was developed to manage and visualize models of a multi-layer perceptron (MLP) and related data. DenseNetworkVisualizer is available at: https://dronperminov.ru/machine-learning/dense-network-visualizer. The system provides the following basic functions.

Configuration of model parameters

Data management

Setting up training

Visualization and analysis

These features collectively enhance the capabilities for managing, training, and analyzing multilayer perceptron models, making DenseNetworkVisualizer a reliable tool for both research and practical applications in the field of machine learning. Examples of the framework’s use can be seen in Fig. 9.

License for DenseNetworkVisualizer

DenseNetworkVisualizer, used to create illustrations in this paper, is provided under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This means that the service can freely be used for any purpose, provided that appropriate credit is given to the original author.

For academic and non-commercial use, please cite this paper.

Access to the service is available at: https://dronperminov.ru/machine-learning/dense-network-visualizer, and the full text of the license can be found at: https://creativecommons.org/licenses/by/4.0/.

Appendix C. HDD data preprocessing

The classification task used data from 2020, covering 162 299 discs. All SMART attributes were distributed within one of three possible ranges: 0-0100, 0-0200, and 0-0253. These ranges represent built-in normalization and vary among different HDD manufacturers.

For predicting hard disc failures, the following SMART attributes were used:

$$ \begin{equation*} 1,\ 3,\ 4,\ 5,\ 7,\ 9,\ 10,\ 12,\ 187,\ 188,\ 190,\ 192,\ 193,\ 194,\ 197,\ 198,\ 199,\ 240,\ 241,\ 242 \end{equation*} \notag $$
for a large feature space, and
$$ \begin{equation*} 5,\ 9,\ 187,\ 188,\ 194,\ 197,\ 198 \end{equation*} \notag $$
for a smaller feature space. Descriptions of the selected SMART attributes:

To obtain a balanced dataset, all discs that failed within the year were included (a total of 1 506 records). Since different manufacturers use different normalization ranges, it was necessary to separate discs with different normalization ranges to ensure accurate model training. In this study, discs with all attributes normalized to 0–100 were used (except for attribute 199, which was in the 0–200 range for all discs). After sorting, the dataset consisted of 538 records. An additional 538 records from functional discs were randomly selected.

Some records were missing attributes (no more than four attributes per record and no more than 10% for any individual attribute across all records). Missing values were inputted using the mean value for each attribute.

For experiments the data were additionally normalized to speed up neural network training and achieve more stable results. The mean value for each attribute was subtracted from the data, and then the data were divided by the variance for each attribute.

The code for data preparation, conducting various experiments for the time-series classification task, and the prepared data are published on GitLab: https://gitlab.com/LukianovKirill/trustai_72.


Bibliography

1. A. Jishan and R. C. Green II, “Cost aware LSTM model for predicting hard disk drive failures based on extremely imbalanced S.M.A.R.T. sensors data”, Eng. Appl. Artif. Intell., 127 (2024), 107339, 11 pp.  crossref
2. A. Caron, C. Hicks, and V. Mavroudis, A view on out-of-distribution identification from a statistical testing theory perspective, 2024, 8 pp., arXiv: 2405.03052
3. Peng Cui and Jinjia Wang, “Out-of-distribution (OOD) detection based on deep learning: a review”, Electronics, 11:21 (2022), 3500, 19 pp.  crossref
4. L. Devroye, L. Györfi, and G. Lugosi, A probabilistic theory of pattern recognition, Appl. Math. (N. Y.), 31, Reprint of the 1996 original, Springer-Verlag, New York, 2013, xvi+636 pp.  crossref  mathscinet  zmath
5. S. M. Djurasevic, U. M. Pesovic, and B. S. Djordjevic, “Anomaly detection model for predicting hard disk drive failures”, Appl. Artif. Intell., 35:8 (2021), 549–566  crossref
6. A. Faragó and G. Lugosi, “Strong universal consistency of neural network classifiers”, IEEE Trans. Inform. Theory, 39:4 (1993), 1146–1151  crossref  zmath
7. D. Hendrycks and K. Gimpel, A baseline for detecting misclassified and out-of-distribution examples in neural networks, 2016 (v1 – 2016), 12 pp., arXiv: 1610.02136
8. J. Jithish, B. Alangot, N. Mahalingam, and Kiat Seng Yeo, “Distributed anomaly detection in smart grids: a federated learning-based approach”, IEEE Access, 11 (2023), 7157–7179  crossref  adsnasa
9. A. Klein, Backblaze: Hard drive data and stats https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data
10. Lingdong Kong, Shaoyuan Xie, Hanjiang Hu, Lai Xing Ng, B. Cottereau, and Wei Tsang Ooi, “Robodepth: Robust out-of-distribution depth estimation under corruptions”, Adv. Neural Inf. Process. Syst., 36 (2023), 1–45
11. Bo Li, Peng Qi, Bo Liu, Shuai Di, Jingen Liu, Jiquan Pei, Jinfeng Yi, and Bowen Zhou, “Trustworthy AI: from principles to practices”, ACM Comput. Surveys, 55:9 (2023), 177, 46 pp.  crossref
12. Jeremiah Zhe Liu, S. Padhy, Jie Ren, Zi Lin, Yeming Wen, G. Jerfel, Z. Nado, J. Snoek, D. Tran, and B. Lakshminarayanan, “A simple approach to improve single-model deep uncertainty via distance-awareness”, J. Mach. Learn. Res., 24 (2023), 42, 63 pp.  mathscinet
13. A. B. Nassif, M. Abu Talib, Q. Nasir, and F. M. Dakalbab, “Machine learning for anomaly detection: a systematic review”, IEEE Access, 9 (2021), 78658–78700  crossref  adsnasa
14. M. Perello-Nieto, T. D. M. E. S. Filho, M. Kull, and P. Flach, “Background check: a general technique to build more reliable and versatile classifiers”, 2016 IEEE 16th international conference on data mining (ICDM), IEEE, 2016, 1143–1148  crossref
15. R. Pinciroli, L. Yang, J. Alter, and E. Smirni, “Lifespan and failures of SSDs and HDDs: similarities, differences, and prediction models”, IEEE Trans. Depend. Secure Comput., 20:1 (2023), 256–272  crossref
16. K. Rasheed, A. Qayyum, M. Ghaly, A. Al-Fuqaha, A. Razi, and J. Qadir, “Explainable, trustworthy, and ethical machine learning for healthcare: a survey”, Comput. Biol. Med., 149 (2022), 106043, 23 pp.  crossref
17. Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, R. Dutta, R. Schaeffer, Sang T. Truong, Simran Arora, M. Mazeika, D. Hendrycks, Zinan Lin, Yu Cheng, S. Koyejo, Dawn Song, and Bo Li, DecodingTrust: a comprehensive assessment of trustworthiness in GPT models, 2024 (v1 – 2023), 110 pp., arXiv: 2306.11698
18. Qibo Yang, Xiaodong Jia, Xiang Li, Jianshe Feng, Wenzhe Li, and Jay Lee, “Evaluating feature selection and anomaly detection methods of hard drive failure prediction”, IEEE Trans. Reliab., 70:2 (2021), 749–760  crossref
19. Hang Yu, Weixu Liu, Jie Lu, Yimin Wen, Xiangfeng Luo, and Guangquan Zhang, “Detecting group concept drift from multiple data streams”, Pattern Recognition, 134 (2023), 109113, 11 pp.  crossref
20. He Zhang, Bang Wu, Xingliang Yuan, Shirui Pan, Hanghang Tong, and Jian Pei, “Trustworthy graph neural networks: aspects, methods, and trends”, Proc. IEEE, 112:2 (2024), 97–139  crossref
21. Jing Zhang, Yuchao Dai, Mochu Xiang, Deng-Ping Fan, P. Moghadam, Mingyi He, C. Walder, Kaihao Zhang, M. Harandi, and N. Barnes, Dense uncertainty estimation, 2021, 15 pp., arXiv: 2110.06427
22. Mingyu Zhang, Wenqiang Ge, Ruichun Tang, and Peishun Liu, “Hard disk failure prediction based on blending ensemble learning”, Appl. Sci., 13:5 (2023), 3288, 22 pp.  crossref
23. Zhilin Zhao, Statistical methods for out-of-distribution detection, PhD thesis, Univ. Technology Sydney, 2023, 107 pp.  mathscinet

Citation: K. S. Lukyanov, P. A. Yaskov, A. I. Perminov, A. P. Kovalenko, D. Y. Turdakov, “Extrapolation of the Bayesian classifier with an unknown support of the two-class mixture distribution”, Russian Math. Surveys, 79:6 (2024), 991–1015
Citation in format AMSBIB
\Bibitem{LukYasPer24}
\by K.~S.~Lukyanov, P.~A.~Yaskov, A.~I.~Perminov, A.~P.~Kovalenko, D.~Y.~Turdakov
\paper Extrapolation of the Bayesian classifier with an unknown support of the two-class mixture distribution
\jour Russian Math. Surveys
\yr 2024
\vol 79
\issue 6
\pages 991--1015
\mathnet{http://mi.mathnet.ru/eng/rm10208}
\crossref{https://doi.org/10.4213/rm10208e}
\mathscinet{https://mathscinet.ams.org/mathscinet-getitem?mr=4867088}
\adsnasa{https://adsabs.harvard.edu/cgi-bin/bib_query?2024RuMaS..79..991L}
\isi{https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=Publons&SrcAuth=Publons_CEL&DestLinkType=FullRecord&DestApp=WOS_CPL&KeyUT=001443210000007}
\scopus{https://www.scopus.com/record/display.url?origin=inward&eid=2-s2.0-105000430331}
Linking options:
  • https://www.mathnet.ru/eng/rm10208
  • https://doi.org/10.4213/rm10208e
  • https://www.mathnet.ru/eng/rm/v79/i6/p57
  • This publication is cited in the following 3 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Успехи математических наук Russian Mathematical Surveys
    Statistics & downloads:
    Abstract page:567
    Russian version PDF:35
    English version PDF:89
    Russian version HTML:46
    English version HTML:189
    References:163
    First page:29
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2026