Abstract:
Modern realities and trends in learning require more and more generalization ability of models, which leads to an increase in both models and training sample size. It is already difficult to solve such tasks in a single device mode. This is the reason why distributed and federated learning approaches are becoming more popular every day. Distributed computing involves communication between devices, which requires solving two key problems: efficiency and privacy. One of the most well-known approaches to combat communication costs is to exploit the similarity of local data. Both Hessian similarity and homogeneous gradients have been studied in the literature, but separately. In this paper we combine both of these assumptions in analyzing a new method that incorporates the ideas of using data similarity and clients sampling. Moreover, to address privacy concerns, we apply the technique of additional noise and analyze its impact on the convergence of the proposed method. The theory is confirmed by training on real datasets.
Bibliography: 45 titles.
Keywords:
ASEG, distributed learning, federated learning, communication costs, Hessian similarity, homogeneous gradients, technique of additional noise.
The work was done in the Laboratory of Federated Learning Problems of the Ivannikov Institute for System Programming of the Russian Academy of Sciences and supported by the Ministry of Science and Higher Education of the Russian Federation under appendix no. 2 to agreement no. 075-03-2024-214.
Empirical risk minimization is a key concept in supervised machine learning [34]. This paradigm is used to train a wide range of models, such as linear and logistic regression [40], support vector machines [30], tree-based models [18], neural networks [33], and many more. In essence, this is an optimization of the parameters based on the average losses in data. By minimizing the empirical risk, the model aims to generalize well to unseen samples and make accurate predictions. In order to enhance the overall predictive performance and effectively tackle advanced tasks, it is imperative to train on large datasets. By leveraging such datasets during the training process, we aim to maximize the generalization capabilities of models and equip them with the robustness and adaptability required to excel in challenging real-world scenarios [14]. The richness and variety of samples in large datasets provides the depth needed to learn complex patterns, relationships, and nuances, enabling them to make accurate and reliable predictions across a wide range of inputs and conditions. As datasets grow in size and complexity, traditional optimization algorithms may struggle to converge in a reasonable amount of time. This increases the interest in developing distributed methods [42]. Formally, the objective function in the minimization problem is a finite sum over nodes, each containing its own possibly private dataset. In the most general form we have
where $M$ refers to the number of nodes/clients/machines/devices, $n_m$ is the size of the $m$th local dataset, $x$ is the vector representation of a model using $d$ features, $y_j^m$ is the $j$th data point on the $m$th node, and $\ell$ is the loss function. We consider the most computationally powerful device that can communicate with all others to be a server. Without loss of generality we assume that $r_1$ is responsible for the server, and the other $r_m$ for other nodes.
This formulation of the problem entails a number of challenges. In particular, transferring information between devices incurs communication overhead [36]. This can result in increased resource utilization, which affects the overall efficiency of the training process. There are different ways to deal with communication costs [29], [23], [3], [9]. One of the key approaches is exploiting the similarity of local data and hence the corresponding losses. One way to formally express it is to use Hessian similarity [39]:
where $\delta>0$ measures the degree of similarity between the corresponding Hessian matrices. The goal is to build methods with communication complexity depending on $\delta$ because it is known that $\delta\thicksim 1/\sqrt{n}$ for non-quadratic losses and $\delta\thicksim 1/n$ for quadratic ones [19]. Here $n$ is the size of the training set. Hence the more data on nodes, the more statistically ‘similar’ the losses are and the less communication is needed. The basic method in this case is $\texttt{Mirror Descent}$ with an appropriate Bregman divergence [35], [19]:
The computation of $\nabla r$ requires communication with clients. Our goal is to reduce it. The key idea is that by offloading nodes, we move the most computationally demanding work to the server. Each node performs one computation per iteration (to collect $\nabla r$ on the server), while server solves a local subproblem and accesses the local oracle $\nabla r_1$ much more frequently. While this approach provides improvements in communication complexity over a simple distributed version of $\texttt{Gradient Descent}$, there is room for improvement in various aspects. In particular, the problem of constructing an optimal algorithm was open for a long time [35], [5], [45], [38]. It was recently solved by the introduction of $\texttt{Accelerated ExtraGradient}$ [24].
Despite the fact that there is an optimal algorithm for communication, there are still open challenges: since each dataset is stored on its own node, it is expensive to compute the full gradient. Moreover, by transmitting it, we lose the ability to keep the procedure private [43], since in this case it is possible to recover local data. One of the best known cheap ways to protect privacy is to add noise to the information communicated [1]. To summarize the above, in this paper, we are interested in stochastic modifications of $\texttt{Accelerated ExtraGradient}$ [24] which would be particularly resistant to the imposition of noise on the transmitted packages.
2. Related works
Since our goal is to investigate stochastic versions of algorithms that exploit data similarity (in particular, to speed up the computation of local gradients and to protect privacy), we give a brief literature review of related areas.
2.1. Distributed methods under similarity
In 2014 the scalable $\texttt{DANE}$ algorithm was introduced [35]. It was a novel Newton-type method, which used $\mathcal{O}\bigl((\delta/\mu)^2\log(1/\varepsilon)\bigr)$ communication rounds for $\mu$-strongly convex quadratic losses by taking a step appropriate to the geometry of the problem. This work can be considered as the first that started to apply Hessian similarity in the distributed setting. A year later $\varOmega\bigl(\sqrt{\delta/\mu}\,\log(1/\varepsilon)\bigr)$ was found to be a lower bound for communication rounds under Hessian similarity [5]. Since then, extensive work has been done to achieve the outlined optimal complexity. The idea of $\texttt{DANE}$ was developed by adding momentum acceleration to the step, thereby reducing the complexity for quadratic loss to $\mathcal{O}\bigl(\sqrt{\delta/\mu}\,\log(1/\varepsilon)\bigr)$ [45]. Later, $\texttt{Accelerated SONATA}$ achieved $\mathcal{O}\bigl(\sqrt{\delta/\mu}\,\log^2(1/\varepsilon)\bigr)$ for non-quadratic losses by the $\texttt{Catalyst}$ acceleration [39]. Finally, there is $\texttt{Accelerated ExtraGradient}$, which achieves $\mathcal{O}\bigl(\sqrt{\delta/\mu}\,\log(1/\varepsilon)\bigr)$ communication complexity by using ideas of Nesterov acceleration, sliding, and $\texttt{ExtraGradient}$ [24].
Other ways to introduce the definition of similarity are also found in the literature. In [22] the authors study the local technique in the distributed and federated optimization, and
is taken as a measure of gradient similarity. Here $x_*$ is the solution to problem (1). This kind of data heterogeneity is also examined in [44]. The case of gradient similarity of the form $\|\nabla r_m (x)-\nabla r(x)\| \leqslant \delta$ is discussed in [17]. Moreover, several papers use gradient similarity in the analysis of distributed saddle point problems [6], [10], [8], [7].
2.2. Privacy preserving
A brief overview of the issue of privacy in modern optimization is presented in [15]. In this paper, we are interested in differential privacy. In essence, differential privacy guarantees that results of a data analysis process remain largely unaffected by the addition or removal of any single individual’s data. This is achieved by introducing randomness or noise into the data in a controlled manner, so that the statistical properties of the data remain intact while protecting the privacy of individual samples [2]. Differential privacy involves randomized perturbations of the information sent between nodes and the server [32]. In this case it is important to design an algorithm that is resistant to transformations.
2.3. Stochastic optimization
The simplest stochastic methods, for example, $\texttt{SGD}$, face a common problem: the approximation of the gradient does not tend to zero when searching for the optimum [16]. The solution to this issue appeared in 2014 with the invention of variance reduction technique. There are two main approaches to its implementation: $\texttt{SAGA}$ [13] and $\texttt{SVRG}$ [20]. The first maintains a gradient history and the second introduces a reference point at which the full gradient is computed, and updates it infrequently. The methods listed above do not use acceleration. In recent years a number of accelerated variance reduction methods emerged [31], [26], [25], [4].
The idea of stochastic methods with variance reduction also carries over to distributed setup. In particular, stochasticity helps to break through lower bounds under the Hessian similarity condition. For example, methods that use variance reduction to implement client sampling are constructed. $\texttt{Catalyzed SVRP}$ enjoys $\mathcal{O}\bigl((1+M^{-1/4} \cdot\sqrt{\delta/\mu}\,)\log^2(1/\varepsilon)\bigr)$ complexity in terms of the amount of server-node communication [21]. This method combines stochastic proximal point evaluation, client sampling, variance reduction, and acceleration via the Catalyst framework [28]. However, it requires strong convexity of each $r_m$. It is possible to move to weaker assumptions on the local functions. There is $\texttt{AccSVRS}$, which achieves $\mathcal{O}\bigl((1+M^{-1/4}\cdot\sqrt{\delta/\mu}\,) \log(1/\varepsilon)\bigr)$ communication complexity [27]. In addition, variance reduction turns out to be useful to design methods with compression. There is $\texttt{Three Pillars Algorithm}$, which achieves $\mathcal{O}\bigl((1+M^{-1/2}\cdot(\delta/\mu))\log(1/\varepsilon)\bigr)$ for variational inequalities [11].
The weakness of the accelerated variance reduction methods is that they require a small step size inversely proportional to $\sqrt{M}$ [11], [21], [27]. Thus, one has to ‘pay’ for convergence to an arbitrary $\varepsilon$-solution by increasing the iteration complexity of the method. However, there is the gain, mentioned already, in terms of the number of server-node communications. Meanwhile, accelerated versions of the classic $\texttt{SGD}$ do not have this flaw [41]. In some regimes the accuracy that can be achieved without the use of variance reduction can be sufficient: these include distributed problems under similarity conditions.
3. Our contribution
$\bullet$ New method. We propose a new method based on the deterministic $\texttt{Accelerated ExtraGradient (AEG)}$ [24]. Unlike $\texttt{AEG}$, we add the ability to sample computational nodes. This saves a significant amount of runtime. As mentioned above, sampling was already implemented in $\texttt{Three Pillars Algorithm}$ [11], $\texttt{Catalyzed SVRP}$ [21], and $\texttt{AccSVRS}$ [27]. However, this was done using $\texttt{SVRG}$-like approaches. We are based instead on $\texttt{SGD}$, which does not require decreasing the step size as the number of computational nodes increases.
$\bullet$ Privacy via noise. We consider stochasticity not only in the choice of nodes, but also in the additive noise in the transmission of information from the client to the server. This makes it much harder to steal private data in the case of an attack on the algorithm.
$\bullet$ Theoretical analysis. We show that Hessian similarity implies gradient one with an additional term. Therefore, it is possible to estimate the variance associated with sampling nodes more optimistically. Thus, in some cases $\texttt{ASEG}$ converges quite accurately before getting ‘stuck’ near the solution.
$\bullet$ Analysis of a subproblem. We consider different approaches to the solution of the subproblem that occurs on a server. Our analysis allows us to estimate the number of iterations required to ensure convergence of the method.
$\bullet$ Experiments. We validate the theory constructed by numerical experiments on several real datasets.
4. Stochastic version of $\texttt{Accelerated ExtraGradient}$
Algorithm 1 ($\texttt{ASEG}$) is a modification of $\texttt{AEG}$ [24]. In one iteration the server communicates with clients twice. In the first round random machines are selected (Line 6). They perform a computation and send the gradient to the server (Line 15). Then the server solves a local subproblem in Line 16 and the solution is used to compute the stochastic (extra)gradient at the second round of communication (Line 17 and Line 26). This is followed by a gradient step with momentum in Line 27.
Let us summarize briefly the main differences from $\texttt{AEG}$. Line 16 of Algorithm 1 uses a gradient over local batches of nodes, rather than over all of them, as done in the baseline method. Moreover, the case where local gradients are noisy is supported. This is needed for privacy purposes. Similarly in Line 27. Note that we consider the most general case where different devices are involved in solving the subproblem and performing the gradient step.
$\texttt{ASEG}$ requires $\varTheta(2B)$ communication rounds (Line 15 and Line 26), where $B$ is the number of nodes involved in the iteration. Unlike our method, $\texttt{AEG}$ does $\varTheta(2M-2)$ rounds per iteration. $\texttt{AccSVRS}$ does $\varTheta(2M)$ rounds on average. Thus, in experiments either the difference between the methods is invisible, or $\texttt{AEG}$ loses. $\texttt{ASEG}$ deals with the local gradient $\nabla r_m(x,\xi)$, which can have the structure
where, for example, $\xi\in N(0,\sigma^2)$ or $\xi\in U(-c,c)$. This is the additive noise overlay technique, mentioned previously and popular in federated learning.
5. Theoretical analysis
Before we delve into analysis, let us introduce some definitions and assumptions to build the convergence theory.
In this paper complexity is understood in the sense of the number of server-node communications. This means that we consider how many times the server exchanges vectors with clients. In this case, a single contact is counted as one communication.
We work with functions of a class widely used in the literature.
Definition 1. We say that a function $f(x)\colon \mathbb{R}^d \to \mathbb{R}$ is $\mu$-strongly convex on $\mathbb{R}^d$ if
We assume a strongly convex optimization problem (1) with possibly non-convex components.
Assumption 1. The function $r$ is $\mu$-strongly convex, and $r_1$ is convex and $L_1$-smooth.
This assumption does not require the convexity of local functions. Indeed, $\delta$-relatedness is a sufficiently strong property to allow them to be non-convex. In this paper we are interested in the case where $\mu<\delta$, since $\mu\ll\delta\ll L$ for large datasets in practice [38].
Proposition 1. Consider the $\delta$-relatedness of $r_m(\,\cdot\,)$ to $r(\,\cdot\,)$ (Definition 3) for each $m\in[1,M]$. In this case we have
Thus, from the similarity of the Hessians follows the similarity of the gradients with some correction. Let us introduce a more general assumption on gradient similarity.
Assumption 2. For all $m\in[1,M]$ Definition 3 is satisfied and the following inequality holds:
Assumption 2 is a generalization of Proposition 1, where the correction term is multiplied by some constant $\zeta$. Formally, it can take any non-negative value, but it is important to analyze only two cases corresponding to different degrees of similarity, $\zeta=0$ and $\zeta=1$. Indeed, any non-negative values of $\zeta$ give asymptotically the same complexity.
A stochastic oracle is available for the function at each node. We impose the following standard assumption.
Assumption 3. For all $m\in\{1,\dots,n\}$ the stochastic oracle $\nabla r_m(x,\xi)$ is unbiased and has a bounded variance:
Remark 1. The Hessian similarity of $r_m$ and $r$ for $m\ne1$ is only used to formulate Proposition 1 and is not needed anywhere else. The next proofs require only that the server expresses the average nature of the data, while the data at nodes can be heterogeneous.
We want to obtain a criterion that guarantees the convergence of Algorithm 1. The obvious idea is to try to find one for the new algorithm by analogy with how it was done in the non-stochastic case [24].
Lemma 1. Consider Algorithm 1 for problem (1) under Assumptions 1, 2, and 3 with the following tuning:
See the proof in Appendix 1. First we deal with the case under the gradient similarity condition ($\zeta=0$), since it removes two additional constraints on the parameters and makes the analysis simpler.
The following theorem is obtained by rewriting the result of Lemma 1 for a finer tuning of the parameters. We take $\mu<\delta$ into account and find that
The theorem asserts convergence only to some neighbourhood of the solution. To guarantee convergence with an arbitrary accuracy, a suitable parameter $\theta$ must be chosen. We propose a corollary of Theorem 1.
Corollary 1. Consider the conditions of Theorem 1. For an appropriate tuning of $\theta$ Algorithm 1 requires
See the proof in Appendix 1. Note that the linear part of the estimate obtained reproduces the convergence of $\texttt{AEG}$. However, for example, for $B=1$ an iteration of $\texttt{ASEG}$ costs $\varTheta(2)$ communications instead of $\varTheta(2M-2)$. Thus, our method requires significantly less communication to achieve the value determined by the sublinear term. Moreover, as mentioned before, in $\texttt{Catalyzed SVRP}$ and $\texttt{AccSVRS}$ estimates an additional $M^{-1/4}$ multiplier arises due to the use of variance reduction. Comparing $M^{-1/4}\cdot\sqrt{\delta/\mu}$ with $BM^{-1}\cdot\sqrt{\delta/\mu}$ , we notice the superiority of $\texttt{ASEG}$ in terms of the approach taken to measure communication complexity.
Note that if the noise of the gradients sent to the server is weak ($\sigma_{\rm noise}=0$) and the functions are absolutely similar in the gradient sense ($\sigma_{\rm sim}=0$), then $\texttt{ASEG}$ has $M^{3/4}/B$ times less complexity in terms of number of communications.
In the case of a convex objective it is not possible to implement the method under Assumption 2 for $\zeta\ne0$, since the correction terms arising cannot be eliminated. Under Assumption 2 for $\zeta=0$ we propose a modification of $\texttt{ASEG}$ for convex objective. In this case we need
See the proof in Appendix 1. It can be seen from this proof that Assumption 2 for $\zeta=1$ leads to worse estimates. Nevertheless, the optimal communication complexity can be achieved if we choose a sufficiently large $B$ which equalizes two linear terms (at least $72\delta^{3/2}/\mu^{3/2}$). It comes from the equality
Thus, even in problems with a poor ratio of $\delta$ to $\mu$, if the number of available nodes is large enough, $\texttt{ASEG}$ outperforms $\texttt{AEG}$, $\texttt{Catalyzed SVRP}$, and $\texttt{AccSVRS}$.
Thus, subproblem is $L_A$-smooth, where $L_A=1/\theta+L_1$ and $1/\theta$ strongly convex.
Despite the fact that problem (7) does not affect the number of communications, we would still like to spend less server time solving it. An obvious solution is to use stochastic approaches.
6.1. The $\texttt{SGD}$ approach
The basic approach is to use $\texttt{Stochastic Gradient Descent}$ ($\texttt{SGD}$) for the strongly convex smooth objective. In this case the convergence rate of the subproblem by argument norm is given by the following theorem [37].
Theorem 3. Let Assumptions 1 and 3 hold for problem (7). Consider $T$ iterations of $\texttt{SGD}$ with step size $\gamma\leqslant 1/(2L_A)$. Then the following estimate holds:
Our interest is to choose a step such that it is possible to avoid getting stuck in a neighbourhood of a solution of the subproblem. In accordance with [37], we note the following.
Theorem 4. Under Assumptions 1 and 3, consider the problem (7) to be solved using $\texttt{SGD}$. There exist decreasing step sizes $\gamma_t\leqslant 1/(2L_A)$ such that for $T$ iterations we obtain
In this way we can achieve the optimal convergence rate of $\texttt{SGD}$. However, we would like to enjoy linear convergence. The idea is to use variance reduction methods.
6.2. The $\texttt{SVRG}$ approach
Another approach to the solution of the subproblem is to use variance reduction methods, for example, $\texttt{Stochastic Variance Reduced Gradient}$ ($\texttt{SVRG}$). Let us take $x_g^k$ as a starting point and look at the first epoch. It is known from [16] that
where $J$ is the epoch size. The next point after the algorithm finishes the current epoch is $x=\dfrac{1}{J}\displaystyle\sum_{j=1}^Jx_j$. We can rewrite for the $T$th epoch:
Thus, for a suitable combination of the epoch size and step, we expect to obtain linear convergence in the subproblem, which is consistent with the experimental results.
6.3. The loopless approach
In this subsection we fill in some gaps in the analysis of the subproblem. The decision criterion for $\texttt{ASEG}$ is (4). Assume that we have obtained
We use this to obtain a lower bound on $T$. The important question is whether it is possible to take fewer steps so that the criterion necessary for the convergence of the whole algorithm is met. In other words, we want to check whether the loopless version of $\texttt{ASEG}$ turns out to be better than the usual one, as in the case of $\texttt{SVRG}$ and $\texttt{KATYUSHA}$ [25].
We are interested in comparing the deterministic and random choices of $T$. Let $\xi$ be a random variable describing the number of steps; $\xi$ takes positive discrete values. Let $A$ be the number of steps for the subproblem of $\texttt{ASEG}$ defined deterministically and satisfying (4). Assume that
where $x_{\xi}$ is the point obtained in $\xi$ solver steps. Obviously, the converse is also true. In other words, condition (9) is a criterion for $\xi$ to solve the initial subproblem. At the same time, we want to minimize $\mathsf{E}\xi$ to take as few steps as possible.
Proposition 2. If $\mathsf{E}[\xi] < A$, then $\mathsf{E}\biggl[\dfrac{1}{\xi^{\alpha}}\biggr]>\dfrac{1}{A^{\alpha}}$ .
Proof. Let $p_i$ be the probability to choose $\xi=i$. Then we have
Thus, in the worst case, if the expectation of $\xi$ is less than the deterministic number of iterations, the criterion cannot be met and there is no probabilistic method of choosing the number of iterations that is better than the deterministic one.
It is worth noting that the same can be said about the pair of $\texttt{SVRG}$ and $\texttt{L-SVRG}$. It is obvious that $\texttt{SVRG}$ could be reduced to $\texttt{L-SVRG}$ if the number of iterations before updating the gradient is considered as a random variable. Thus, the epoch size $J$ is from the geometric distribution, and one can compare which algorithm gives a more accurate solution after performing the inner loop. For the $\texttt{SVRG}$-rate we have [16]
Here $x$ refers to the parameters of the model, $a_j^m$ is the feature description, and $b_j^m$ is the corresponding label. To avoid overfitting, we use ${l_2}$-regularization. The penalty parameter $\lambda$ is set to be $L/100$, where $L$ is the smoothness constant (see Definition 2). We use $\texttt{SVRG}$ and $\texttt{SARAH}$ as the subproblem solvers in Algorithm 1. In all tests the original datasets are merged into batches $\{r_b\}_{i=1}^{M+N}$. A function $r_1(x)$ is created from part of the acquired ones:
$$
\begin{equation*}
r_1(x)=\frac{1}{N}\sum_{b=1}^{N} r_b(x),\qquad N < M.
\end{equation*}
\notag
$$
Due to Hessian similarity we have the following expression for the smoothness constant:
For logistic regression $\delta$ is estimated by enumerating 100 points in a neighbourhood of the solution. In both cases the value obtained is multiplied by $3/2$. In order to demonstrate correctly how the theory works in practice, we solve problem (1) for various parameters $\delta$, $\mu$, $M$, and $N$. We consider different tabular datasets from LibSVM [12]: $\texttt{Mushrooms}$ (binary classification, 112 features, 8 124 objects), $\texttt{W8A}$ (binary classification, 300 features, 49 749 objects), and $\texttt{A9A}$ (binary classification, 123 features, 32 561 objects). They allow for a wide range of different problem parameters and are most commonly used to illustrate the convergence of methods in the literature.
$\bullet$ We investigate how the size of the batches affects the convergence of $\texttt{ASEG}$ on problems with different parameters. Fig. 1 shows the dependence of untuned $\texttt{ASEG}$ convergence on the batch size. In the first experiment the method on 50 nodes solves the problem for $\mu=0.105$ and $\delta=1.45$. According to Corollary 2, one should choose $B>72(\delta/\mu)^{3/2} \approx 3\,695$ to obtain the optimal complexity. That is, for any size of the batch there is a non-optimal rate. In the second experiment the method on 200 nodes solves the problem for $\mu=0.06$ and $\delta=0.07$. In this case $\texttt{ASEG}$ has the guaranteed optimal complexity for $B>57$ and converges to rather small values of the criterion.
$\bullet$ We compare $\texttt{ASEG}$ with $\texttt{AccSVRS}$ (Figs. 2 and 3). 50 nodes are available for $\texttt{Mushrooms}$ and 200 nodes are available for $\texttt{A9A}$. If the relation between $\mu$, $\delta$, and $B$ yields the optimal complexity according to Corollary 2, then due to similarity there is an optimistic estimate for the batching noise providing a significant gain over $\texttt{AccSVRS}$. Otherwise, it is worth using full batching and obtaining a less appreciable gain.
$\bullet$ In $\texttt{ASEG}$ stability experiments (Figs. 4–7) it is tested how white noise in each $\nabla r_m(x)$ affects convergence. It can be seen that increasing the randomness does not worsen convergence. This can be explained by the fact that convergence is mainly determined by the accuracy of the solver chosen.
$\bullet$ $\texttt{ASEG}$ experiments test how changing the epoch size changes the convergence of $\texttt{ASEG}$ and the convergence in the subproblem (Line 16 of Algorithm 1) for the $\texttt{SVRG}$ and $\texttt{SARAH}$ solvers (Fig. 8–15).
Based on the $\texttt{SGD}$-like approach, we have adapted $\texttt{Accelerated ExtraGradient}$ to solve Federated Learning problems by adding two stochasticities: sampling of computational nodes and the application of noise to local gradients. Numerical experiments show that for small values of $\delta/\mu$ or a sufficiently large number of available nodes Algorithm 1 ($\texttt{ASEG}$) is significantly superior to accelerated variance reduction methods (Fig. 3), which is consistent with the theory (Corollary 1). If the problem parameters are not ‘good’ enough, there is a $[\delta^2/(\mu^2M)]\cdot\log(1/\varepsilon)$ term in the communication complexity of batched $\texttt{ASEG}$ (Corollary 2). Nevertheless, even in this case the superiority in the number of communications remains (Figs. 2 and 3). Experiments also show the robustness of our method against noise (Figs. 4–7). The server subproblem has also been studied numerically (Figs. 10, 11, 14, and 15) and theoretically. The analysis shows the inefficiency of the randomly choice of the number of solver iterations for the internal subproblem (Proposition 2).
Expectations are implied in the context of the proofs.
In this section we prove Lemma 1. First, we need the following assertion.
Lemma 2. Consider Algorithm 1. Let $\theta$ be defined as in Lemma 1: $\theta \leqslant 1/(3\delta)$. Then under Assumptions 1 and 3 the following inequality holds for all $\overline{x} \in \mathbb{R}^d$:
Let us use the tower property and take the $\xi$-expectation first. Since $m_i$ and $m_j$ are independent, we can separately take the expectations of the factors in the second summand. Then only the first summand remains, which is estimated based on Assumption 2. Using Assumption 2 we obtain
As in the case of Corollary 1, an iteration of the method requires $B/M$ communications. $\Box$
Appendix D. $\texttt{ASEG}$ for the convex case
The next algorithm is an adaptation of Algorithm 1 to the convex case. We use time-varying $\tau_k$ and $\eta_k$.
Theorem 6. Let Algorithm 2 run for $N$ iterations under Assumption 1 for $\mu= 0$, Assumption 2 for $\zeta=0$, and Assumption 3, with the following tuning:
Similarly to the case where $\mu>0$, it is not possible to ensure convergence by a naive choice of $\theta$. We propose a choice of $\theta$ that ensures optimal convergence.
Corollary 4. Consider Assumption 1 for $\mu=0$, Assumption 2 for $\zeta=0$, and Assumption 3. Let Algorithm 2 run for $N$ iterations with the tuning of $\tau_k$ and $\eta_k$ as in Theorem 6. Then the following tuning of $\theta$ is proposed.
Notice that $\theta= \dfrac{\sqrt{B}\,\|x_0-x_*\|}{\sqrt{3}\,\sigma(N+1)^{3/2}}$ equalizes the terms on the right-hand side. If $\dfrac{\sqrt{B}\,\|x_0-x_*\|}{\sqrt{3}\,\sigma(N+1)^{3/2}} \leqslant \dfrac{1}{3\delta}$ , we choose $\theta= \dfrac{\sqrt{B}\,\|x_0-x_*\|}{\sqrt{3}\,\sigma(N+1)^{3/2}}$ . Then we obtain
M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and Li Zhang, “Deep learning with differential privacy”, Proceedings of the 2016 ACM SIGSAC conference on computer and communications security (Vienna 2016), ACM, New York, 2016, 308–318
2.
M. Aitsam, “Differential privacy made easy”, 2022 International conference on emerging trends in electrical, control, and telecommunication engineering (ETECTE) (Lahore 2022), IEEE, 2022, 1–7
3.
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding”, NIPS'17: Proceedings of the 31st international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 30, Curran Associates, Inc., Red Hook, NY, 2017, 1707–1718, https://proceedings.neurips.cc/paper/2017/hash/6c340f25839e6acdc73414517203f5f0-Abstract.html
4.
Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochastic gradient methods”, J. Mach. Learn. Res., 18 (2018), 221, 51 pp.
B. Barazandeh, Tianjian Huang, and G. Michailidis, “A decentralized adaptive momentum method for solving a class of min-max optimization problems”, Signal Process., 189 (2021), 108245, 23 pp.
7.
A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, and A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2762, 38116–38133 https://proceedings.neurips.cc/paper_files/paper/2022/hash/f9379afacdbabfdc6b060972b60f9ab8-Abstract-Conference.html
8.
A. Beznosikov and A. Gasnikov, “Compression and data similarity: combination of two techniques for communication-efficient solving of distributed variational inequalities”, Optimization and applications, Lecture Notes in Comput. Sci., 13781, Springer, Cham, 2022, 151–162
9.
A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan, “On biased compression for distributed learning”, J. Mach. Learn. Res., 24 (2023), 276, 50 pp.
10.
A. Beznosikov, G. Scutari, A. Rogozin, and A. Gasnikov, “Distributed saddle-point problems under similarity”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 625, 8172–8184 https://proceedings.neurips.cc/paper/2021/hash/44e65d3e9bc2f88b2b3d566de51a5381-Abstract.html
11.
A. Beznosikov, M. Takác, and A. Gasnikov, “Similarity, compression and local steps: three pillars of efficient communications for distributed variational inequalities”, NIPS'23: Proceedings of the 37th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 36, Curran Associates, Inc., Red Hook, NY, 2023, 1246, 28663–28677 https://proceedings.neurips.cc/paper_files/paper/2023/hash/5b4a459db23e6db9be2a128380953d96-Abstract-Conference.html
12.
Chih-Chung Chang and Chih-Jen Lin, “LIBSVM: A library for support vector machines”, ACM Trans. Intell. Syst. Technol. (TIST), 2:3 (2011), 27, 27 pp.
13.
A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives”, NIPS'14: Proceedings of the 27th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 27, MIT Press, Cambridge, MA, 2014, 1646–1654, https://proceedings.neurips.cc/paper_files/paper/2014/hash/ede7e2b6d13a41ddf9f4bdef84fdc737-Abstract.html
S. Gade and N. H. Vaidya, “Private optimization on networks”, 2018 Annual american control conference (ACC) (Milwaukee, WI 2018), IEEE, 2018, 1402–1409
16.
E. Gorbunov, F. Hanzely, and P. Richtárik, “A unified theory of SGD: variance reduction, sampling, quantization and coordinate descent”, Proceedings of the 23rd international conference on artificial intelligence and statistics (AISTATS 2020), Proc. Mach. Learn. Res. (PMLR), 108, 2020, 680–690https://proceedings.mlr.press/v108/gorbunov20a.html; 2019, 38 pp., arXiv: 1905.11261
17.
E. Gorbunov, F. Hanzely, and P. Richtárik, “Local SGD: unified theory and new efficient methods”, Proceedings of the 24th international conference on artificial intelligence and statistics (AISTATS 2021), Proc. Mach. Learn. Res. (PMLR), 130, 2021, 3556–3564https://proceedings.mlr.press/v130/gorbunov21a.html
18.
T. M. Hehn, J. F. Kooij, and F. A. Hamprecht, “End-to-end learning of decision trees and forests”, Int. J. Comput. Vis., 128:4 (2020), 997–1011
19.
H. Hendrikx, Lin Xiao, S. Bubeck, F. Bach, and L. Massoulie, “Statistically preconditioned accelerated gradient method for distributed optimization”, Proceedings of the 37th international conference on machine learning (ICML 2020), Proc. Mach. Learn. Res. (PMLR), 119, 2020, 4203–4227, https://proceedings.mlr.press/v119/hendrikx20a.html
A. Khaled and Chi Jin, Faster federated optimization under second-order similarity, 2023 (v1 – 2022), 44 pp., arXiv: 2209.02257
22.
A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local SGD on identical and heterogeneous data”, Proceedings of the 23rd international conference on artificial intelligence and statistics (AISTATS 2020), Proc. Mach. Learn. Res. (PMLR), 108, 2020, 4519–4529https://proceedings.mlr.press/v108/bayoumi20a.html
23.
A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates”, Proceedings of the 37th international conference on machine learning (ICML 2020), Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5381–5393, https://proceedings.mlr.press/v119/koloskova20a.html
24.
D. Kovalev, A. Beznosikov, E. Borodich, A. Gasnikov, and G. Scutari, “Optimal gradient sliding and its application to optimal distributed optimization under similarity”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2427, 33494–33507, https://proceedings.neurips.cc/paper_files/paper/2022/hash/d88f6f81e1aaf606776ffdd06fdf24ef-Abstract-Conference.html
25.
D. Kovalev, S. Horváth, and P. Richtárik, “Don't jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop”, Proceedings of the 31st international conference on algorithmic learning theory (ALT 2020), Proc. Mach. Learn. Res. (PMLR), 117, 2020, 451–467 https://proceedings.mlr.press/v117/kovalev20a.html
26.
Zhize Li, Hongyan Bao, Xiangliang Zhang, and P. Richtárik, “PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization”, Proceedings of the 38th international conference on machine learning (ICML 2021), Proc. Mach. Learn. Res. (PMLR), 139, 2021, 6286–6295https://proceedings.mlr.press/v139/li21a.html
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data”, Proceedings of the 20th international conference on artificial intelligence and statistics (AISTATS 2017), Proc. Mach. Learn. Res. (PMLR), 54, 2017, 1273–1282 https://proceedings.mlr.press/v54/mcmahan17a
30.
J. M. Moguerza and A. Muñoz, “Support vector machines with applications”, Statist. Sci., 21:3 (2006), 322–336
31.
L. M. Nguyen, Jie Liu, K. Scheinberg, and M. Takáč, “SARAH: A novel method for machine learning problems using stochastic recursive gradient”, Proceedings of the 34th international conference on machine learning (ICML 2017), Proc. Mach. Learn. Res. (PMLR), 70, 2017, 2613–2621https://proceedings.mlr.press/v70/nguyen17b.html
32.
E. Nozari, P. Tallapragada, and J. Cortés, “Differentially private distributed convex optimization via functional perturbation”, IEEE Trans. Control Netw. Syst., 5:1 (2018), 395–408
S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence”, J. Mach. Learn. Res., 11 (2010), 2635–2670
35.
O. Shamir, N. Srebro, and Tong Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method”, Proceedings of the 31st international conference on machine learning (ICML 2014), Proc. Mach. Learn. Res. (PMLR), 32, 2014, 1000–1008https://proceedings.mlr.press/v32/shamir14.html
36.
S. Sreenivasan, R. Cohen, E. López, Z. Toroczkai, and H. E. Stanley, Communication bottlenecks in scale-free networks, 2006, 5 pp., arXiv: cs/0604023
37.
S. U. Stich, Unified optimal analysis of the (stochastic) gradient method, 2019, 11 pp., arXiv: 1907.04232
38.
Ying Sun, G. Scutari, and A. Daneshmand, “Distributed optimization based on gradient tracking revisited: enhancing convergence rate via surrogation”, SIAM J. Optim., 32:2 (2022), 354–385
39.
Yг Tian, G. Scutari, Tianyu Cao, and A. Gasnikov, “Acceleration in distributed optimization under similarity”, Proceedings of the 25th international conference on artificial intelligence and statistics (AISTATS 2022), Proc. Mach. Learn. Res. (PMLR), 151, 2022, 5721–5756https://proceedings.mlr.press/v151/tian22b.html
40.
G. Tripepi, K. J. Jager, F. W. Dekker, and C. Zoccali, “Linear and logistic regression analysis”, Kidney Int., 73:7 (2008), 806–810
41.
S. Vaswani, I. Laradji, F. Kunstner, Si Yi Meng, M. Schmidt, and S. Lacoste-Julien, Adaptive gradient methods converge faster with over-parameterization (but you should do a line-search), 2021 (v1 – 2020), 41 pp., arXiv: 2006.06835
42.
J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning”, ACM Comput. Surveys, 53:2 (2020), 30, 33 pp.
43.
P. C. Weeraddana and C. Fischione, “On the privacy of optimization”, IFAC-PapersOnLine, 50:1 (2017), 9502–9508
Xiao-Tong Yuan and Ping Li, “On convergence of distributed approximate Newton methods: globalization, sharper bounds and beyond”, J. Mach. Learn. Res., 21 (2020), 206, 51 pp.
Citation:
D. A. Bylinkin, K. D. Degtyarev, A. N. Beznosikov, “Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning”, Russian Math. Surveys, 79:6 (2024), 939–973