D. A. Bylinkin, K. D. Degtyarev, A. N. Beznosikov, “Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning”, Russian Math. Surveys, 79:6 (2024), 939

Russian Mathematical Surveys

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive
	Impact factor
	Submit a manuscript

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Uspekhi Mat. Nauk:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Russian Mathematical Surveys, 2024, Volume 79, Issue 6, Pages 939–973
DOI: https://doi.org/10.4213/rm10206e (Mi rm10206)

Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning

D. A. Bylinkin^ab, K. D. Degtyarev^a, A. N. Beznosikov^cbd

^a Moscow Institute of Physics and Technology (National Research University), Moscow, Russia
^b Ivannikov Institute for System Programming of the Russian Academy of Sciences, Moscow, Russia
^c Sber AI Lab, Moscow, Russia
^d Innopolis University, Innopolis, Russia

English version PDF (2575 kB) HTML full-text Russian version article

References:

PDF

HTML

DOI: https://doi.org/10.4213/rm10206e

Abstract: Modern realities and trends in learning require more and more generalization ability of models, which leads to an increase in both models and training sample size. It is already difficult to solve such tasks in a single device mode. This is the reason why distributed and federated learning approaches are becoming more popular every day. Distributed computing involves communication between devices, which requires solving two key problems: efficiency and privacy. One of the most well-known approaches to combat communication costs is to exploit the similarity of local data. Both Hessian similarity and homogeneous gradients have been studied in the literature, but separately. In this paper we combine both of these assumptions in analyzing a new method that incorporates the ideas of using data similarity and clients sampling. Moreover, to address privacy concerns, we apply the technique of additional noise and analyze its impact on the convergence of the proposed method. The theory is confirmed by training on real datasets.
Bibliography: 45 titles.

Keywords: ASEG, distributed learning, federated learning, communication costs, Hessian similarity, homogeneous gradients, technique of additional noise.

Funding agency	Grant number
Ministry of Science and Higher Education of the Russian Federation	075-03-2024-214
The work was done in the Laboratory of Federated Learning Problems of the Ivannikov Institute for System Programming of the Russian Academy of Sciences and supported by the Ministry of Science and Higher Education of the Russian Federation under appendix no. 2 to agreement no. 075-03-2024-214.

Received: 15.08.2024

Published: 20.02.2025

Bibliographic databases:

Document Type: Article

UDC: 519.853.3+519.853.62

MSC: 68T05, 68T07

Language: English

Original paper language: English

1. Introduction

Empirical risk minimization is a key concept in supervised machine learning [34]. This paradigm is used to train a wide range of models, such as linear and logistic regression [40], support vector machines [30], tree-based models [18], neural networks [33], and many more. In essence, this is an optimization of the parameters based on the average losses in data. By minimizing the empirical risk, the model aims to generalize well to unseen samples and make accurate predictions. In order to enhance the overall predictive performance and effectively tackle advanced tasks, it is imperative to train on large datasets. By leveraging such datasets during the training process, we aim to maximize the generalization capabilities of models and equip them with the robustness and adaptability required to excel in challenging real-world scenarios [14]. The richness and variety of samples in large datasets provides the depth needed to learn complex patterns, relationships, and nuances, enabling them to make accurate and reliable predictions across a wide range of inputs and conditions. As datasets grow in size and complexity, traditional optimization algorithms may struggle to converge in a reasonable amount of time. This increases the interest in developing distributed methods [42]. Formally, the objective function in the minimization problem is a finite sum over nodes, each containing its own possibly private dataset. In the most general form we have

$$ \begin{equation} \min_{x\in\mathbb{R}^d}\biggl[r(x)=\frac{1}{M}\sum_{m=1}^Mr_m(x)= \frac{1}{M}\sum_{m=1}^M\biggl(\frac{1}{n_m} \sum_{j=1}^{n_m} \ell(x, y_j^m)\biggr)\biggr], \end{equation} \tag{1} $$

where $M$ refers to the number of nodes/clients/machines/devices, $n_m$ is the size of the $m$th local dataset, $x$ is the vector representation of a model using $d$ features, $y_j^m$ is the $j$th data point on the $m$th node, and $\ell$ is the loss function. We consider the most computationally powerful device that can communicate with all others to be a server. Without loss of generality we assume that $r_1$ is responsible for the server, and the other $r_m$ for other nodes.

This formulation of the problem entails a number of challenges. In particular, transferring information between devices incurs communication overhead [36]. This can result in increased resource utilization, which affects the overall efficiency of the training process. There are different ways to deal with communication costs [29], [23], [3], [9]. One of the key approaches is exploiting the similarity of local data and hence the corresponding losses. One way to formally express it is to use Hessian similarity [39]:

$$ \begin{equation} \|\nabla^2 r_m (x)-\nabla^2 r(x)\| \leqslant \delta, \end{equation} \tag{2} $$

where $\delta>0$ measures the degree of similarity between the corresponding Hessian matrices. The goal is to build methods with communication complexity depending on $\delta$ because it is known that $\delta\thicksim 1/\sqrt{n}$ for non-quadratic losses and $\delta\thicksim 1/n$ for quadratic ones [19]. Here $n$ is the size of the training set. Hence the more data on nodes, the more statistically ‘similar’ the losses are and the less communication is needed. The basic method in this case is $\texttt{Mirror Descent}$ with an appropriate Bregman divergence [35], [19]:

$$ \begin{equation*} x_{k+1}=\arg\,\min_{x \in \mathbb{R}^d}\biggl(r_1(x)+\langle \nabla r(x_k)- \nabla r_1(x_k),x \rangle+\frac{\delta}{2}\,\|x-x_k\|^2 \biggr). \end{equation*} \notag $$

The computation of $\nabla r$ requires communication with clients. Our goal is to reduce it. The key idea is that by offloading nodes, we move the most computationally demanding work to the server. Each node performs one computation per iteration (to collect $\nabla r$ on the server), while server solves a local subproblem and accesses the local oracle $\nabla r_1$ much more frequently. While this approach provides improvements in communication complexity over a simple distributed version of $\texttt{Gradient Descent}$, there is room for improvement in various aspects. In particular, the problem of constructing an optimal algorithm was open for a long time [35], [5], [45], [38]. It was recently solved by the introduction of $\texttt{Accelerated ExtraGradient}$ [24].

Despite the fact that there is an optimal algorithm for communication, there are still open challenges: since each dataset is stored on its own node, it is expensive to compute the full gradient. Moreover, by transmitting it, we lose the ability to keep the procedure private [43], since in this case it is possible to recover local data. One of the best known cheap ways to protect privacy is to add noise to the information communicated [1]. To summarize the above, in this paper, we are interested in stochastic modifications of $\texttt{Accelerated ExtraGradient}$ [24] which would be particularly resistant to the imposition of noise on the transmitted packages.

2. Related works

Since our goal is to investigate stochastic versions of algorithms that exploit data similarity (in particular, to speed up the computation of local gradients and to protect privacy), we give a brief literature review of related areas.

2.1. Distributed methods under similarity

In 2014 the scalable $\texttt{DANE}$ algorithm was introduced [35]. It was a novel Newton-type method, which used $\mathcal{O}\bigl((\delta/\mu)^2\log(1/\varepsilon)\bigr)$ communication rounds for $\mu$-strongly convex quadratic losses by taking a step appropriate to the geometry of the problem. This work can be considered as the first that started to apply Hessian similarity in the distributed setting. A year later $\varOmega\bigl(\sqrt{\delta/\mu}\,\log(1/\varepsilon)\bigr)$ was found to be a lower bound for communication rounds under Hessian similarity [5]. Since then, extensive work has been done to achieve the outlined optimal complexity. The idea of $\texttt{DANE}$ was developed by adding momentum acceleration to the step, thereby reducing the complexity for quadratic loss to $\mathcal{O}\bigl(\sqrt{\delta/\mu}\,\log(1/\varepsilon)\bigr)$ [45]. Later, $\texttt{Accelerated SONATA}$ achieved $\mathcal{O}\bigl(\sqrt{\delta/\mu}\,\log^2(1/\varepsilon)\bigr)$ for non-quadratic losses by the $\texttt{Catalyst}$ acceleration [39]. Finally, there is $\texttt{Accelerated ExtraGradient}$, which achieves $\mathcal{O}\bigl(\sqrt{\delta/\mu}\,\log(1/\varepsilon)\bigr)$ communication complexity by using ideas of Nesterov acceleration, sliding, and $\texttt{ExtraGradient}$ [24].

Other ways to introduce the definition of similarity are also found in the literature. In [22] the authors study the local technique in the distributed and federated optimization, and

$$ \begin{equation*} \sigma_{\rm sim}^2=\frac{1}{M}\sum_{m=1}^M\|\nabla r_m(x_*)\|^2 \end{equation*} \notag $$

is taken as a measure of gradient similarity. Here $x_*$ is the solution to problem (1). This kind of data heterogeneity is also examined in [44]. The case of gradient similarity of the form $\|\nabla r_m (x)-\nabla r(x)\| \leqslant \delta$ is discussed in [17]. Moreover, several papers use gradient similarity in the analysis of distributed saddle point problems [6], [10], [8], [7].

2.2. Privacy preserving

A brief overview of the issue of privacy in modern optimization is presented in [15]. In this paper, we are interested in differential privacy. In essence, differential privacy guarantees that results of a data analysis process remain largely unaffected by the addition or removal of any single individual’s data. This is achieved by introducing randomness or noise into the data in a controlled manner, so that the statistical properties of the data remain intact while protecting the privacy of individual samples [2]. Differential privacy involves randomized perturbations of the information sent between nodes and the server [32]. In this case it is important to design an algorithm that is resistant to transformations.

2.3. Stochastic optimization

The simplest stochastic methods, for example, $\texttt{SGD}$, face a common problem: the approximation of the gradient does not tend to zero when searching for the optimum [16]. The solution to this issue appeared in 2014 with the invention of variance reduction technique. There are two main approaches to its implementation: $\texttt{SAGA}$ [13] and $\texttt{SVRG}$ [20]. The first maintains a gradient history and the second introduces a reference point at which the full gradient is computed, and updates it infrequently. The methods listed above do not use acceleration. In recent years a number of accelerated variance reduction methods emerged [31], [26], [25], [4].

The idea of stochastic methods with variance reduction also carries over to distributed setup. In particular, stochasticity helps to break through lower bounds under the Hessian similarity condition. For example, methods that use variance reduction to implement client sampling are constructed. $\texttt{Catalyzed SVRP}$ enjoys $\mathcal{O}\bigl((1+M^{-1/4} \cdot\sqrt{\delta/\mu}\,)\log^2(1/\varepsilon)\bigr)$ complexity in terms of the amount of server-node communication [21]. This method combines stochastic proximal point evaluation, client sampling, variance reduction, and acceleration via the Catalyst framework [28]. However, it requires strong convexity of each $r_m$. It is possible to move to weaker assumptions on the local functions. There is $\texttt{AccSVRS}$, which achieves $\mathcal{O}\bigl((1+M^{-1/4}\cdot\sqrt{\delta/\mu}\,) \log(1/\varepsilon)\bigr)$ communication complexity [27]. In addition, variance reduction turns out to be useful to design methods with compression. There is $\texttt{Three Pillars Algorithm}$, which achieves $\mathcal{O}\bigl((1+M^{-1/2}\cdot(\delta/\mu))\log(1/\varepsilon)\bigr)$ for variational inequalities [11].

The weakness of the accelerated variance reduction methods is that they require a small step size inversely proportional to $\sqrt{M}$ [11], [21], [27]. Thus, one has to ‘pay’ for convergence to an arbitrary $\varepsilon$-solution by increasing the iteration complexity of the method. However, there is the gain, mentioned already, in terms of the number of server-node communications. Meanwhile, accelerated versions of the classic $\texttt{SGD}$ do not have this flaw [41]. In some regimes the accuracy that can be achieved without the use of variance reduction can be sufficient: these include distributed problems under similarity conditions.

3. Our contribution

$\bullet$ New method. We propose a new method based on the deterministic $\texttt{Accelerated ExtraGradient (AEG)}$ [24]. Unlike $\texttt{AEG}$, we add the ability to sample computational nodes. This saves a significant amount of runtime. As mentioned above, sampling was already implemented in $\texttt{Three Pillars Algorithm}$ [11], $\texttt{Catalyzed SVRP}$ [21], and $\texttt{AccSVRS}$ [27]. However, this was done using $\texttt{SVRG}$-like approaches. We are based instead on $\texttt{SGD}$, which does not require decreasing the step size as the number of computational nodes increases.

$\bullet$ Privacy via noise. We consider stochasticity not only in the choice of nodes, but also in the additive noise in the transmission of information from the client to the server. This makes it much harder to steal private data in the case of an attack on the algorithm.

$\bullet$ Theoretical analysis. We show that Hessian similarity implies gradient one with an additional term. Therefore, it is possible to estimate the variance associated with sampling nodes more optimistically. Thus, in some cases $\texttt{ASEG}$ converges quite accurately before getting ‘stuck’ near the solution.

$\bullet$ Analysis of a subproblem. We consider different approaches to the solution of the subproblem that occurs on a server. Our analysis allows us to estimate the number of iterations required to ensure convergence of the method.

$\bullet$ Experiments. We validate the theory constructed by numerical experiments on several real datasets.

4. Stochastic version of $\texttt{Accelerated ExtraGradient}$

Algorithm 1 ($\texttt{ASEG}$) is a modification of $\texttt{AEG}$ [24]. In one iteration the server communicates with clients twice. In the first round random machines are selected (Line 6). They perform a computation and send the gradient to the server (Line 15). Then the server solves a local subproblem in Line 16 and the solution is used to compute the stochastic (extra)gradient at the second round of communication (Line 17 and Line 26). This is followed by a gradient step with momentum in Line 27.

Let us summarize briefly the main differences from $\texttt{AEG}$. Line 16 of Algorithm 1 uses a gradient over local batches of nodes, rather than over all of them, as done in the baseline method. Moreover, the case where local gradients are noisy is supported. This is needed for privacy purposes. Similarly in Line 27. Note that we consider the most general case where different devices are involved in solving the subproblem and performing the gradient step.

$\texttt{ASEG}$ requires $\varTheta(2B)$ communication rounds (Line 15 and Line 26), where $B$ is the number of nodes involved in the iteration. Unlike our method, $\texttt{AEG}$ does $\varTheta(2M-2)$ rounds per iteration. $\texttt{AccSVRS}$ does $\varTheta(2M)$ rounds on average. Thus, in experiments either the difference between the methods is invisible, or $\texttt{AEG}$ loses. $\texttt{ASEG}$ deals with the local gradient $\nabla r_m(x,\xi)$, which can have the structure

$$ \begin{equation*} \nabla r_m(x,\xi)=\nabla r_m(x)+\xi, \end{equation*} \notag $$

where, for example, $\xi\in N(0,\sigma^2)$ or $\xi\in U(-c,c)$. This is the additive noise overlay technique, mentioned previously and popular in federated learning.

5. Theoretical analysis

Before we delve into analysis, let us introduce some definitions and assumptions to build the convergence theory.

In this paper complexity is understood in the sense of the number of server-node communications. This means that we consider how many times the server exchanges vectors with clients. In this case, a single contact is counted as one communication.

We work with functions of a class widely used in the literature.

Definition 1. We say that a function $f(x)\colon \mathbb{R}^d \to \mathbb{R}$ is $\mu$-strongly convex on $\mathbb{R}^d$ if

$$ \begin{equation*} f(x)\geqslant f(y)+\langle\nabla f(y),x-y\rangle+\frac{\mu}{2}\,\|x-y\|^2\quad \forall\,x,y\in\mathbb{R}^d. \end{equation*} \notag $$

If $\mu=0$, we call $f$ a convex function.

Definition 2. We say that a function $f(x)\colon \mathbb{R}^d \to \mathbb{R}$ is $L$-smooth on $\mathbb{R}^d$ if

$$ \begin{equation*} \|\nabla f(x)-\nabla f(y)\|\leqslant L\|x-y\|\quad \forall\,x,y\in\mathbb{R}^d. \end{equation*} \notag $$

Definition 3. We say that $f_m(\,\cdot\,)$ is $\delta$-related to $f(\,\cdot\,)$ if

$$ \begin{equation*} \|\nabla^2 f_m(x)-\nabla^2 f(x)\| \leqslant \delta\quad \forall\,x\in\mathbb{R}^d. \end{equation*} \notag $$

We assume a strongly convex optimization problem (1) with possibly non-convex components.

Assumption 1. The function $r$ is $\mu$-strongly convex, and $r_1$ is convex and $L_1$-smooth.

This assumption does not require the convexity of local functions. Indeed, $\delta$-relatedness is a sufficiently strong property to allow them to be non-convex. In this paper we are interested in the case where $\mu<\delta$, since $\mu\ll\delta\ll L$ for large datasets in practice [38].

Proposition 1. Consider the $\delta$-relatedness of $r_m(\,\cdot\,)$ to $r(\,\cdot\,)$ (Definition 3) for each $m\in[1,M]$. In this case we have

$$ \begin{equation*} \|\nabla r_m(x)-\nabla r(x)\|^2 \leqslant \sigma_{\rm sim}^2+2\delta^2\|x-x_*\|^2, \end{equation*} \notag $$

where $\sigma_{\rm sim}^2=2\max_{m\in[1,M]}\|\nabla r_m(x_*)\|^2$ and $x_*$ is the solution of problem (1).

Proof. It is obvious that Definition 3 implies the smoothness of $r_m-r$:

$$ \begin{equation*} \|\nabla(r_m-r)(x)-\nabla(r_m-r)(x_*)\|\leqslant\delta\|x-x_*\|\quad \forall\,m\in[1,M],\ \ x\in\mathbb{R}^d. \end{equation*} \notag $$

Since $x_*$ is the optimum, it follows that $\nabla r(x_*)=0$. It is known that $\|x-y\|^2\leqslant2\|x\|^2+2\|y\|^2$. We obtain

$$ \begin{equation*} \|\nabla r_m(x)-\nabla r(x)\|^2 \leqslant 2\|\nabla r_m(x_*)\|^2+2\delta^2\|x-x_*\|^2. \quad \Box \end{equation*} \notag $$

Thus, from the similarity of the Hessians follows the similarity of the gradients with some correction. Let us introduce a more general assumption on gradient similarity.

Assumption 2. For all $m\in[1,M]$ Definition 3 is satisfied and the following inequality holds:

$$ \begin{equation*} \|\nabla r_m(x)-\nabla r(x)\|^2 \leqslant \sigma_{\rm sim}^2+2\delta^2\|x-x_*\|^2\cdot\zeta, \end{equation*} \notag $$

where $\zeta\geqslant0$.

Assumption 2 is a generalization of Proposition 1, where the correction term is multiplied by some constant $\zeta$. Formally, it can take any non-negative value, but it is important to analyze only two cases corresponding to different degrees of similarity, $\zeta=0$ and $\zeta=1$. Indeed, any non-negative values of $\zeta$ give asymptotically the same complexity.

A stochastic oracle is available for the function at each node. We impose the following standard assumption.

Assumption 3. For all $m\in\{1,\dots,n\}$ the stochastic oracle $\nabla r_m(x,\xi)$ is unbiased and has a bounded variance:

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}_{\xi}[\nabla r_m(x,\xi)\mid x]=\nabla r_m(x),\quad \mathsf{E}_{\xi}[\|\nabla r_m(x,\xi)-\nabla r_m(x)\|^2\mid x]\leqslant \sigma^2_{\rm noise} \quad \forall\,x\in\mathbb{R}^d. \end{aligned} \end{equation*} \notag $$

Remark 1. The Hessian similarity of $r_m$ and $r$ for $m\ne1$ is only used to formulate Proposition 1 and is not needed anywhere else. The next proofs require only that the server expresses the average nature of the data, while the data at nodes can be heterogeneous.

We want to obtain a criterion that guarantees the convergence of Algorithm 1. The obvious idea is to try to find one for the new algorithm by analogy with how it was done in the non-stochastic case [24].

Lemma 1. Consider Algorithm 1 for problem (1) under Assumptions 1, 2, and 3 with the following tuning:

$$ \begin{equation} \alpha=\frac{\mu}{3}\,,\quad \eta\leqslant\frac{1}{3\alpha}\,,\quad \eta\leqslant\frac{\theta}{3\tau}\,,\quad \theta\leqslant\frac{1}{3\delta}\,, \end{equation} \tag{3} $$

and let $x_f^{k+1}$ satisfy

$$ \begin{equation} \|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2 \leqslant \frac{9\delta^2}{11}\,\Bigl\| x_g^k- \arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2. \end{equation} \tag{4} $$

Then the following inequality holds:

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\biggl[\frac{1}{\eta}\| x^{k+1}-x^*\|^2+ \frac{2}{\tau}\bigl[r(x_f^{k+1})-r(x^*)\bigr]\biggr] \\ &\qquad\leqslant\biggl(1-\alpha\eta+\frac{12\theta\delta^2\eta}{B}\cdot \zeta\biggr)\, \mathsf{E}\biggl[\frac{1}{\eta}\|x^k-x^*\|^2\biggr]+ (1-\tau)\mathsf{E}\biggl[\frac{2}{\tau}[r(x_f^k)-r(x^*)]\biggr] \\ &\qquad\qquad+\mathsf{E}\biggl[\biggl(\frac{12\theta\delta^2}{B\tau}- \frac{\mu}{6}\biggr)\cdot\zeta \bigl(\|x_f^k-x_*\|^2+ \|x_f^{k+1}-x_*\|^2\bigr)\biggr]+\frac{4\theta}{B\tau}\sigma^2. \end{aligned} \end{equation*} \notag $$

See the proof in Appendix 1. First we deal with the case under the gradient similarity condition ($\zeta=0$), since it removes two additional constraints on the parameters and makes the analysis simpler.

The following theorem is obtained by rewriting the result of Lemma 1 for a finer tuning of the parameters. We take $\mu<\delta$ into account and find that

$$ \begin{equation*} 1-\frac{\sqrt{\mu\theta}}{3}\geqslant1-\frac{1}{3}\,. \end{equation*} \notag $$

Theorem 1. In Lemma 1 consider Assumption 2 for $\zeta=0$ and adjust the parameters:

$$ \begin{equation} \alpha=\frac{\mu}{3}\,,\quad \eta=\min\biggl\{\frac{1}{3\alpha}\,,\frac{\theta}{3\tau}\biggr\},\quad \tau=\frac{\sqrt{\mu\theta}}{3}\,,\quad \theta\leqslant\frac{1}{3\delta}\,. \end{equation} \tag{5} $$

Then

$$ \begin{equation*} \mathsf{E}[\varPhi_{k+1}]\leqslant\biggl(1-\frac{\sqrt{\mu\theta}}{3}\,\biggr) \mathsf{E}[\varPhi_k]+\frac{4\theta\sigma^2}{B}\,, \end{equation*} \notag $$

where

$$ \begin{equation*} \varPhi_k=\frac{\tau}{\eta}\|x^k-x^*\|^2+2[r(x_f^k)-r(x^*)]. \end{equation*} \notag $$

The theorem asserts convergence only to some neighbourhood of the solution. To guarantee convergence with an arbitrary accuracy, a suitable parameter $\theta$ must be chosen. We propose a corollary of Theorem 1.

Corollary 1. Consider the conditions of Theorem 1. For an appropriate tuning of $\theta$ Algorithm 1 requires

$$ \begin{equation*} \mathcal{O}\Biggl( \frac{B}{M}\sqrt{\frac{\delta}{\mu}}\, \log\frac{1}{\varepsilon}+\frac{\sigma_{\rm sim}^2+ \sigma_{\rm noise}^2}{\mu M\varepsilon}\Biggr)\ \ \textit{communications} \end{equation*} \notag $$

to reach an arbitrary $\varepsilon$-solution.

See the proof in Appendix 1. Note that the linear part of the estimate obtained reproduces the convergence of $\texttt{AEG}$. However, for example, for $B=1$ an iteration of $\texttt{ASEG}$ costs $\varTheta(2)$ communications instead of $\varTheta(2M-2)$. Thus, our method requires significantly less communication to achieve the value determined by the sublinear term. Moreover, as mentioned before, in $\texttt{Catalyzed SVRP}$ and $\texttt{AccSVRS}$ estimates an additional $M^{-1/4}$ multiplier arises due to the use of variance reduction. Comparing $M^{-1/4}\cdot\sqrt{\delta/\mu}$ with $BM^{-1}\cdot\sqrt{\delta/\mu}$ , we notice the superiority of $\texttt{ASEG}$ in terms of the approach taken to measure communication complexity.

Note that if the noise of the gradients sent to the server is weak ($\sigma_{\rm noise}=0$) and the functions are absolutely similar in the gradient sense ($\sigma_{\rm sim}=0$), then $\texttt{ASEG}$ has $M^{3/4}/B$ times less complexity in terms of number of communications.

In the case of a convex objective it is not possible to implement the method under Assumption 2 for $\zeta\ne0$, since the correction terms arising cannot be eliminated. Under Assumption 2 for $\zeta=0$ we propose a modification of $\texttt{ASEG}$ for convex objective. In this case we need

$$ \begin{equation*} \mathcal{O}\biggl(\frac{B}{M}\,\frac{\sqrt{\delta}\,\|x_0-x_*\|} {\sqrt{\varepsilon}}+\frac{(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2) \|x_0-x_*\|^2}{M\varepsilon^2}\biggr) \ \ \text{communications} \end{equation*} \notag $$

to reach an arbitrary $\varepsilon$-solution. See Appendix ap:aseg_conv.

The case $\zeta=1$ is more complex and requires more fine-tuned analysis.

Theorem 2. In Lemma 1 consider Assumption 2 for $\zeta=1$ and tune the parameters as follows:

$$ \begin{equation} \alpha=\frac{\mu}{3}\,,\quad \eta=\min\biggl\{\frac{1}{3\alpha}\,, \frac{\theta}{3\tau}\biggr\},\quad \tau=\sqrt{\mu\theta}\,,\quad \theta\leqslant\min\biggl\{\frac{1}{3\delta}\,, \frac{\mu^3B^2}{5184\,\delta^4}\biggr\}. \end{equation} \tag{6} $$

Then

$$ \begin{equation*} \mathsf{E}[\varPhi_{k+1}]\leqslant \biggl(1-\frac{\sqrt{\mu\theta}}{18}\biggr)\mathsf{E}[\varPhi_k]+ \frac{4\theta\sigma^2}{B}\,, \end{equation*} \notag $$

where

$$ \begin{equation*} \varPhi_k=\frac{\tau}{\eta}\|x^k-x^*\|^2+2[r(x_f^k)-r(x^*)]. \end{equation*} \notag $$

See the proof in Appendix 2.

Corollary 2. Consider the conditions of Theorem 2. For an appropriate tuning of $\theta$ Algorithm 1 requires

$$ \begin{equation*} \mathcal{O}\biggl(\frac{B}{M}\sqrt{\frac{\delta}{\mu}}\, \log\frac{1}{\varepsilon}+\frac{\delta^2}{\mu^2M} \log\frac{1}{\varepsilon}+\frac{\sigma_{\rm sim}^2+ \sigma_{\rm noise}^2}{\mu M\varepsilon} \biggr) \ \ \textit{communications} \end{equation*} \notag $$

to reach an arbitrary $\varepsilon$-solution.

See the proof in Appendix 1. It can be seen from this proof that Assumption 2 for $\zeta=1$ leads to worse estimates. Nevertheless, the optimal communication complexity can be achieved if we choose a sufficiently large $B$ which equalizes two linear terms (at least $72\delta^{3/2}/\mu^{3/2}$). It comes from the equality

$$ \begin{equation*} \frac{1}{3\delta}=\frac{\mu^3B^2}{5184\,\delta^4}\,. \end{equation*} \notag $$

Thus, even in problems with a poor ratio of $\delta$ to $\mu$, if the number of available nodes is large enough, $\texttt{ASEG}$ outperforms $\texttt{AEG}$, $\texttt{Catalyzed SVRP}$, and $\texttt{AccSVRS}$.

6. Analysis of the subproblem

In this section we focus on the quantity

$$ \begin{equation} \min_{x \in \mathbb{R}^d}\biggl[ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x) := \langle s_k,x-x_g^k\rangle+\frac{1}{2\theta}\|x-x_g^k\|^2+r_1(x)\biggr], \end{equation} \tag{7} $$

defined in Line 16 of Algorithm 1. Note that problem (7) is solved on the server and does not affect the communication complexity. Obviously,

$$ \begin{equation*} \bigl\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x)- \nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(y)\bigr\|=\biggl\|\frac{x-y}{\theta}+ \nabla r_1(x)-\nabla r_1(y)\biggr\| \leqslant \biggl(\frac{1}{\theta}+L_1\biggr)\|x-y\|. \end{equation*} \notag $$

Moreover,

$$ \begin{equation*} \nabla^2 \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x)= \frac{1}{\theta}I+\nabla^2 r_1(x)\succ 0. \end{equation*} \notag $$

Thus, subproblem is $L_A$-smooth, where $L_A=1/\theta+L_1$ and $1/\theta$ strongly convex.

Despite the fact that problem (7) does not affect the number of communications, we would still like to spend less server time solving it. An obvious solution is to use stochastic approaches.

6.1. The $\texttt{SGD}$ approach

The basic approach is to use $\texttt{Stochastic Gradient Descent}$ ($\texttt{SGD}$) for the strongly convex smooth objective. In this case the convergence rate of the subproblem by argument norm is given by the following theorem [37].

Theorem 3. Let Assumptions 1 and 3 hold for problem (7). Consider $T$ iterations of $\texttt{SGD}$ with step size $\gamma\leqslant 1/(2L_A)$. Then the following estimate holds:

$$ \begin{equation*} \mathsf{E}\Bigl[\|x_{T}-\arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\|^2 \Bigm| s_k\Bigr]\leqslant \biggl(1-\frac{\gamma}{\theta}\biggr)^{T}\|x_0- \arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\|^2+ \theta\gamma\sigma_1^2. \end{equation*} \notag $$

Using the smoothness of the subproblem we observe that

$$ \begin{equation*} \mathsf{E}[\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_{T})\|^2\,| s_k]\leqslant L_A^2\mathsf{E}[\|x_{T}- \arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\|^2\,| s_k\Bigr]. \end{equation*} \notag $$

Let us formulate a consequence of Theorem 3 for this case.

Corollary 3. Under Assumptions 1 and 3 consider problem (7) and $T$ iterations of $\texttt{SGD}$. Then the following estimate holds:

$$ \begin{equation*} \mathsf{E}\bigl[\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_{T})\|^2\mid s_k\bigr] \leqslant L_A^2\biggl(1-\frac{\gamma}{\theta}\biggr)^T\Bigl\|x_0- \arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2+ L_A^2\theta\gamma\sigma_1^2. \end{equation*} \notag $$

Our interest is to choose a step such that it is possible to avoid getting stuck in a neighbourhood of a solution of the subproblem. In accordance with [37], we note the following.

Theorem 4. Under Assumptions 1 and 3, consider the problem (7) to be solved using $\texttt{SGD}$. There exist decreasing step sizes $\gamma_t\leqslant 1/(2L_A)$ such that for $T$ iterations we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}[\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_{T})\|^2\mid s_k]&\leqslant 64\theta L_A^3\Bigl\|x_0-\arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2\exp\biggl\{-\frac{T}{4L_A\theta}\biggr\} \\ &\qquad+\frac{36L_A^2\sigma_1^2\theta}{T}\,. \end{aligned} \end{equation*} \notag $$

In this way we can achieve the optimal convergence rate of $\texttt{SGD}$. However, we would like to enjoy linear convergence. The idea is to use variance reduction methods.

6.2. The $\texttt{SVRG}$ approach

Another approach to the solution of the subproblem is to use variance reduction methods, for example, $\texttt{Stochastic Variance Reduced Gradient}$ ($\texttt{SVRG}$). Let us take $x_g^k$ as a starting point and look at the first epoch. It is known from [16] that

$$ \begin{equation*} \mathsf{E}\biggl[ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k \biggl(\frac{1}{J}\sum_{j=1}^Jx_j\biggr)- \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_*)\biggr] \leqslant \frac{2(\mu^{-1}+2J\gamma^2L_A)}{J\gamma(1-2\gamma L_A)}\, \mathsf{E}[ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_g^k)- \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_*)], \end{equation*} \notag $$

where $J$ is the epoch size. The next point after the algorithm finishes the current epoch is $x=\dfrac{1}{J}\displaystyle\sum_{j=1}^Jx_j$. We can rewrite for the $T$th epoch:

$$ \begin{equation*} \mathsf{E}[ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(\widehat{x}_T)- \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_*)] \leqslant \biggl(\frac{2(\mu^{-1}+ 2J\gamma^2L_A)}{J\gamma(1-2\gamma L_A)}\biggr)^T\mathsf{E} [ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_g^k)- \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_*)]. \end{equation*} \notag $$

Then we write out the strong convexity property:

$$ \begin{equation*} \mathsf{E}[ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_g^k)- \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_*)] \leqslant\frac{1}{2\mu}\mathsf{E}\, \bigl[\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_g^k)\|^2\bigr]. \end{equation*} \notag $$

Using the $L_A$-smoothness of $ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k$ we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(\widehat{x}_T)\|^2\bigr] &\leqslant 2L\mathsf{E}\bigl[ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(\widehat{x}_T)- \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_*)\bigr] \\ &\leqslant \frac{L_A}{\mu}\biggl(\frac{2(\mu^{-1}+2J\gamma^2L_A)} {J\gamma(1-2\gamma L_A)}\biggr)^T\mathsf{E} \bigl[\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_g^k)\|^2\bigr]. \end{aligned} \end{equation*} \notag $$

Thus, for a suitable combination of the epoch size and step, we expect to obtain linear convergence in the subproblem, which is consistent with the experimental results.

6.3. The loopless approach

In this subsection we fill in some gaps in the analysis of the subproblem. The decision criterion for $\texttt{ASEG}$ is (4). Assume that we have obtained

$$ \begin{equation} \|\nabla A_{\theta}^k(x_T)\|^2= \mathcal{O}\biggl(\frac{1}{T^{\alpha}}\biggr), \qquad \alpha>0. \end{equation} \tag{8} $$

We use this to obtain a lower bound on $T$. The important question is whether it is possible to take fewer steps so that the criterion necessary for the convergence of the whole algorithm is met. In other words, we want to check whether the loopless version of $\texttt{ASEG}$ turns out to be better than the usual one, as in the case of $\texttt{SVRG}$ and $\texttt{KATYUSHA}$ [25].

We are interested in comparing the deterministic and random choices of $T$. Let $\xi$ be a random variable describing the number of steps; $\xi$ takes positive discrete values. Let $A$ be the number of steps for the subproblem of $\texttt{ASEG}$ defined deterministically and satisfying (4). Assume that

$$ \begin{equation} \frac{1}{\xi^{\alpha}}\leqslant \frac{1}{A^{\alpha}}\,. \end{equation} \tag{9} $$

Then from (8) we know that

$$ \begin{equation*} \|\nabla A_{\theta}^k(x_\xi)\|^2\leqslant \|\nabla A_{\theta}^k(x_A)\|^2, \end{equation*} \notag $$

where $x_{\xi}$ is the point obtained in $\xi$ solver steps. Obviously, the converse is also true. In other words, condition (9) is a criterion for $\xi$ to solve the initial subproblem. At the same time, we want to minimize $\mathsf{E}\xi$ to take as few steps as possible.

Proposition 2. If $\mathsf{E}[\xi] < A$, then $\mathsf{E}\biggl[\dfrac{1}{\xi^{\alpha}}\biggr]>\dfrac{1}{A^{\alpha}}$ .

Proof. Let $p_i$ be the probability to choose $\xi=i$. Then we have

$$ \begin{equation*} \mathsf{E}\biggl[\frac{1}{\xi^{\alpha}}\biggr]= \sum_{i=1}^Tp_i\,\frac{1}{i^{\alpha}}\,. \end{equation*} \notag $$

Since $\alpha>0$, we see that $\phi(i)=1/i^{\alpha}$ is a convex function. Using Jensen’s inequality we obtain

$$ \begin{equation*} \mathsf{E}\biggl[\frac{1}{\xi^{\alpha}}\biggr] \geqslant \frac{1}{(\mathsf{E}[\xi])^{\alpha}}> \frac{1}{A^{\alpha}}\,. \quad\Box \end{equation*} \notag $$

Thus, in the worst case, if the expectation of $\xi$ is less than the deterministic number of iterations, the criterion cannot be met and there is no probabilistic method of choosing the number of iterations that is better than the deterministic one.

It is worth noting that the same can be said about the pair of $\texttt{SVRG}$ and $\texttt{L-SVRG}$. It is obvious that $\texttt{SVRG}$ could be reduced to $\texttt{L-SVRG}$ if the number of iterations before updating the gradient is considered as a random variable. Thus, the epoch size $J$ is from the geometric distribution, and one can compare which algorithm gives a more accurate solution after performing the inner loop. For the $\texttt{SVRG}$-rate we have [16]

$$ \begin{equation} \mathsf{E}\bigl[ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_{J(t+1)})- \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_*)\bigr] \leqslant \biggl(\frac{1}{\eta\gamma(1-2L\gamma)J}+\frac{2L\gamma}{1-2L\gamma}\biggr) \mathsf{E}\bigl[ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_{Jt})- \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _{\theta}^k(x_*)\bigr], \end{equation} \tag{10} $$

where $\gamma$ is the step and $\eta > 0$ is some constant. From Proposition 2 and inequality (10) we obtain

$$ \begin{equation*} \frac{1}{\eta\gamma(1-2L\gamma)}\mathsf{E}\biggl[\frac{1}{J}\biggr]+ \frac{2L\gamma}{1-2L\gamma} \leqslant \frac{1}{\eta\gamma(1-2L\gamma)J}+\frac{2L\gamma}{1-2L\gamma}\,. \end{equation*} \notag $$

Thus, in the worst case loopless versions converges no faster than ordinary ones.

7. Numerical experiments

To support the theoretical analysis and insights, we solve both the logistic regression problem

$$ \begin{equation*} r(x)=\frac{1}{M}\sum_{m=1}^M \frac{1}{n}\sum_{j=1}^n \ln\bigl(1+\exp\{-b^m_j\langle x,a^m_j\rangle\}\bigr)+\lambda\|x\|^2 \end{equation*} \notag $$

and the quadratic regression problem

$$ \begin{equation*} r(x)=\frac{1}{M}\sum_{m=1}^M \frac{1}{n}\sum_{j=1}^n \bigl(\langle x,a^m_j\rangle -b^m_j\bigr)^2+\lambda\|x\|^2. \end{equation*} \notag $$

Here $x$ refers to the parameters of the model, $a_j^m$ is the feature description, and $b_j^m$ is the corresponding label. To avoid overfitting, we use ${l_2}$-regularization. The penalty parameter $\lambda$ is set to be $L/100$, where $L$ is the smoothness constant (see Definition 2). We use $\texttt{SVRG}$ and $\texttt{SARAH}$ as the subproblem solvers in Algorithm 1. In all tests the original datasets are merged into batches $\{r_b\}_{i=1}^{M+N}$. A function $r_1(x)$ is created from part of the acquired ones:

$$ \begin{equation*} r_1(x)=\frac{1}{N}\sum_{b=1}^{N} r_b(x),\qquad N < M. \end{equation*} \notag $$

Due to Hessian similarity we have the following expression for the smoothness constant:

$$ \begin{equation*} \|\nabla^2 r(x)-\nabla^2 r_1(x)\| \leqslant \delta, \end{equation*} \notag $$

where

$$ \begin{equation*} r(x)=\frac{1}{M}\sum_{i=1}^M r_i(x). \end{equation*} \notag $$

Then for the quadratic problem we have

$$ \begin{equation*} \delta=\biggl\|\frac{1}{N}\sum_{i=1}^N A_i- \frac{1}{M}\sum_{j=1}^M A_j\biggr\|. \end{equation*} \notag $$

For logistic regression $\delta$ is estimated by enumerating 100 points in a neighbourhood of the solution. In both cases the value obtained is multiplied by $3/2$. In order to demonstrate correctly how the theory works in practice, we solve problem (1) for various parameters $\delta$, $\mu$, $M$, and $N$. We consider different tabular datasets from LibSVM [12]: $\texttt{Mushrooms}$ (binary classification, 112 features, 8 124 objects), $\texttt{W8A}$ (binary classification, 300 features, 49 749 objects), and $\texttt{A9A}$ (binary classification, 123 features, 32 561 objects). They allow for a wide range of different problem parameters and are most commonly used to illustrate the convergence of methods in the literature.

$\bullet$ We investigate how the size of the batches affects the convergence of $\texttt{ASEG}$ on problems with different parameters. Fig. 1 shows the dependence of untuned $\texttt{ASEG}$ convergence on the batch size. In the first experiment the method on 50 nodes solves the problem for $\mu=0.105$ and $\delta=1.45$. According to Corollary 2, one should choose $B>72(\delta/\mu)^{3/2} \approx 3\,695$ to obtain the optimal complexity. That is, for any size of the batch there is a non-optimal rate. In the second experiment the method on 200 nodes solves the problem for $\mu=0.06$ and $\delta=0.07$. In this case $\texttt{ASEG}$ has the guaranteed optimal complexity for $B>57$ and converges to rather small values of the criterion.

$\bullet$ We compare $\texttt{ASEG}$ with $\texttt{AccSVRS}$ (Figs. 2 and 3). 50 nodes are available for $\texttt{Mushrooms}$ and 200 nodes are available for $\texttt{A9A}$. If the relation between $\mu$, $\delta$, and $B$ yields the optimal complexity according to Corollary 2, then due to similarity there is an optimistic estimate for the batching noise providing a significant gain over $\texttt{AccSVRS}$. Otherwise, it is worth using full batching and obtaining a less appreciable gain.

$\bullet$ In $\texttt{ASEG}$ stability experiments (Figs. 4–7) it is tested how white noise in each $\nabla r_m(x)$ affects convergence. It can be seen that increasing the randomness does not worsen convergence. This can be explained by the fact that convergence is mainly determined by the accuracy of the solver chosen.

$\bullet$ $\texttt{ASEG}$ experiments test how changing the epoch size changes the convergence of $\texttt{ASEG}$ and the convergence in the subproblem (Line 16 of Algorithm 1) for the $\texttt{SVRG}$ and $\texttt{SARAH}$ solvers (Fig. 8–15).

Figure 1.$\texttt{ASEG}$ with different batch sizes. (The number of communications on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 2.$\texttt{ASEG}$ vs. $\texttt{AccSVRS}$ on data with $\delta=10.15$ for the quadratic problem and $\delta=1.45$ for the logistic one. (The number of communications on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 3.$\texttt{ASEG}$ vs. $\texttt{AccSVRS}$ on data with $\delta=0.064$ for the quadratic problem and $\delta=0.061$ for the logistic one. (The number of communications on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 4.The stability of $\texttt{ASEG}$ with the $\texttt{SVRG}$-solver and uniform noise. The comparison is made by solving the quadratic problem on $M=200$ nodes. (The number of communication rounds on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 5.The stability of $\texttt{ASEG}$ with the $\texttt{SVRG}$-solver and uniform noise. The comparison is made by solving the logistic problem on $M=200$ nodes. (The number of communication rounds on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 6.The stability of $\texttt{ASEG}$ with the $\texttt{SARAH}$-solver and uniform noise. The comparison is made by solving the quadratic problem on $M=200$ nodes. (The number of communication rounds on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 7.The stability of $\texttt{ASEG}$ with the $\texttt{SARAH}$-solver and uniform noise. The comparison is made by solving logistic problem on $M=200$ nodes. (The number of communication rounds on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 8.$\texttt{ASEG}$ with the $\texttt{SVRG}$-solver. The comparison is made by solving the quadratic problem on $M=200$ nodes with different epoch sizes of the solver. (The number of communication rounds on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 9.$\texttt{ASEG}$ with the $\texttt{SVRG}$-solver. The comparison is made by solving the logistic problem on $M=200$ nodes with different epoch sizes of the solver. (The number of communication rounds on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 10.The subproblem with the $\texttt{SVRG}$-solver. The comparison is made by solving quadratic problem on $M=200$ nodes with different epoch sizes of the solver. (The number of server iterations on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 11.The subproblem with the $\texttt{SVRG}$-solver. The comparison is made by solving logistic problem on $M=200$ nodes with different epoch sizes of the solver. (The number of server iterations on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 12.$\texttt{ASEG}$ with the $\texttt{SARAH}$-solver. The comparison is made by solving quadratic problem on $M=200$ nodes with different epoch sizes of the solver. (The number of communication rounds on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 13.$\texttt{ASEG}$ with the $\texttt{SARAH}$-solver. The comparison is made by solving logistic problem on $M=200$ nodes with different epoch sizes of the solver. (The number of communication rounds on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 14.The subproblem with the $\texttt{SARAH}$-solver. The comparison is made by solving the quadratic problem on $M=200$ nodes with different epoch sizes of the solver. (The number of server iterations on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

Figure 15.The subproblem with the $\texttt{SARAH}$-solver. The comparison is made by solving the logistic problem on $M=200$ nodes with different epoch sizes of the solver. (The number of server iterations on the horizontal axis and the value of $\|\nabla f(x)\|^2$ on the vertical axis.)

8. Conclusion

Based on the $\texttt{SGD}$-like approach, we have adapted $\texttt{Accelerated ExtraGradient}$ to solve Federated Learning problems by adding two stochasticities: sampling of computational nodes and the application of noise to local gradients. Numerical experiments show that for small values of $\delta/\mu$ or a sufficiently large number of available nodes Algorithm 1 ($\texttt{ASEG}$) is significantly superior to accelerated variance reduction methods (Fig. 3), which is consistent with the theory (Corollary 1). If the problem parameters are not ‘good’ enough, there is a $[\delta^2/(\mu^2M)]\cdot\log(1/\varepsilon)$ term in the communication complexity of batched $\texttt{ASEG}$ (Corollary 2). Nevertheless, even in this case the superiority in the number of communications remains (Figs. 2 and 3). Experiments also show the robustness of our method against noise (Figs. 4–7). The server subproblem has also been studied numerically (Figs. 10, 11, 14, and 15) and theoretically. The analysis shows the inefficiency of the randomly choice of the number of solver iterations for the internal subproblem (Proposition 2).

Appendix A. Proof of Lemma 1

In all appendices we consider

$$ \begin{equation*} r_1(x)=q(x)\quad\text{and}\quad p(x)=r(x)-r_1(x). \end{equation*} \notag $$

Expectations are implied in the context of the proofs.

In this section we prove Lemma 1. First, we need the following assertion.

Lemma 2. Consider Algorithm 1. Let $\theta$ be defined as in Lemma 1: $\theta \leqslant 1/(3\delta)$. Then under Assumptions 1 and 3 the following inequality holds for all $\overline{x} \in \mathbb{R}^d$:

$$ \begin{equation} \begin{aligned} \, \nonumber &\mathsf{E}\bigl[2\langle\overline{x}- x_g^k,t_k\rangle\mid x^k,x_f^{k+1}\bigr] \\ \nonumber &\qquad\leqslant 2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]-\mu\|x_f^{k+1}- \overline{x}\|^2-\theta\|\nabla r(x_f^{k+1})\|^2 \\ \nonumber &\qquad\qquad+\frac{11\theta}{3} \biggl(\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2- \frac{9\delta^2}{11}\Bigl\|x_g^k- \arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2\biggr) \\ &\qquad\qquad+3\theta\|\nabla p(x_g^k)-s_k\|^2. \end{aligned} \end{equation} \tag{11} $$

Proof. Using the $\mu$-strong convexity of $r(x)$ we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[2\langle\overline{x}-x_g^k, t_k\rangle \mid x^k, x_f^{k+1}\bigr]&=2\bigl\langle\overline{x}- x_g^k,\mathsf{E}[t_k\mid x^k, x_f^{k+1}]\bigr\rangle \\ &=2\bigl\langle\overline{x}-x_g^k,\nabla r(x_f^{k+1})\bigr\rangle \\ &=2\bigl\langle\overline{x}-x_f^{k+1},\nabla r(x_f^{k+1})\bigr\rangle +2\bigl\langle x_f^{k+1}-x_g^k,\nabla r(x_f^{k+1})\bigr\rangle \\ &\leqslant 2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]- \mu\|x_f^{k+1}-\overline{x}\|^2 \\ &\qquad+2\bigl\langle x_f^{k+1}-x_g^k,\nabla r(x_f^{k+1})\bigr\rangle \\ &=2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]-\mu\|x_f^{k+1}-\overline{x}\|^2 \\ &\qquad+2\theta\bigl\langle\theta^{-1}(x_f^{k+1}-x_g^k), \nabla r(x_f^{k+1})\bigr\rangle. \end{aligned} \end{equation*} \notag $$

By the properties of the norm we can write

$$ \begin{equation*} \begin{aligned} \, \biggl\|\frac{x_f^{k+1}-x_g^k}{\theta}+\nabla r(x_f^{k+1})\biggr\|^2&= \frac{1}{\theta^2}\|x_f^{k+1}-x_g^k\|^2+\|\nabla r(x_f^{k+1})\|^2 \\ &\qquad+ \frac{2}{\theta}\bigl\langle x_f^{k+1}- x_g^k,\nabla r(x_f^{k+1})\bigr\rangle. \end{aligned} \end{equation*} \notag $$

Thus,

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\bigl[2\langle\overline{x}-x_g^k,t_k\rangle \mid x^k, x_f^{k+1}\bigr]=2[r(\overline{x})-r(x_f^{k+1})]- \mu\|x_f^{k+1}-\overline{x}\|^2-\frac{1}{\theta}\|x_f^{k+1}-x_g^k\|^2 \\ &\qquad-\theta\bigl\|\nabla r(x_f^{k+1})\bigr\|^2+ \theta\bigl\|\theta^{-1}(x_f^{k+1}-x_g^k)+\nabla r(x_f^{k+1})\bigr\|^2. \end{aligned} \end{equation*} \notag $$

The definition of $ \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)$ and the $\delta$-Lipschitzness of $\nabla p$ give

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\bigl[2\langle\overline{x}-x_g^k,t_k\rangle\mid x^k,x_f^{k+1}\bigr] \\ &\qquad\leqslant 2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]-\mu\| x_f^{k+1}- \overline{x}\|^2-\frac{1}{\theta}\|x_f^{k+1}-x_g^k\|^2 \\ &\qquad\qquad+\theta\bigl\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})+ \nabla p(x_f^{k+1})-\nabla p(x_g^{k})+\nabla p(x_g^{k})-s_k)\bigr\|^2 \\ &\qquad\qquad-\theta\|\nabla r(x_f^{k+1})\|^2. \end{aligned} \end{equation*} \notag $$

Note that

$$ \begin{equation*} \|a+b+c\|^2\leqslant3\|a\|^2+3\|b\|^2+3\|c\|^2, \end{equation*} \notag $$

and Hessian similarity (Definition 3) implies that

$$ \begin{equation*} \|\nabla p(x_f^{k+1})-\nabla p(x_g^k)\|^2 \leqslant \delta^2\|x_f^{k+1}-x_g^k\|^2. \end{equation*} \notag $$

Thus,

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[2\langle\overline{x}-x_g^k,t_k\rangle\mid x^k,x_f^{k+1}\bigr] &\leqslant 2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]-\mu\|x_f^{k+1}-\overline{x}\|^2- \frac{1}{\theta}\| x_f^{k+1}-x_g^k\|^2 \\ &\qquad-\theta\|\nabla r(x_f^{k+1})\|^2+ \theta\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2 \\ &\qquad+3\theta\delta^2\|x_f^{k+1}-x_g^k\|^2+3\theta\|\nabla p(x_g^k)-s_k\|^2 \\ &=2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]-\mu\|x_f^{k+1}-\overline{x}\|^2 \\ &\qquad-\frac{1}{\theta}(1-3\theta^2\delta^2)\|x_f^{k+1}-x_g^k\|^2- \theta\|\nabla r(x_f^{k+1})\|^2 \\ &\qquad+3\theta\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2+ 3\theta\|\nabla p(x_g^k)-s_k\|^2. \end{aligned} \end{equation*} \notag $$

For $\theta \leqslant 1/(3\delta)$ we have

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[2\langle\overline{x}-x_g^k,t_k\rangle&\mid x^k,x_f^{k+1}\bigr] \leqslant 2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]- \mu\|x_f^{k+1}-\overline{x}\|^2-\frac{2}{3\theta}\|x_f^{k+1}-x_g^k\|^2 \\ &\qquad-\theta\|\nabla r(x_f^{k+1})\|^2+ 3\theta\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2 +3\theta\|\nabla p(x_g^k)-s_k\|^2 \\ &\leqslant 2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]-\mu\| x_f^{k+1}- \overline{x}\|^2-\theta\|\nabla r(x_f^{k+1})\|^2 \\ &\qquad+3\theta\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2+ 3\theta\|\nabla p(x_g^k)-s_k\|^2 \\ &\qquad-\frac{1}{3\theta}\Bigl\|x_g^k- \arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2 +\frac{2}{3\theta}\Bigl\|x_f^{k+1}- \arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2. \end{aligned} \end{equation*} \notag $$

Here we have used the fact that

$$ \begin{equation*} -\|a-c\|^2\leqslant-\frac{1}{2}\|a-b\|^2+\|b-c\|^2. \end{equation*} \notag $$

One can observe that $A_\theta^k(x)$ is $(1/\theta)$-strongly convex. Hence

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\bigl[2\langle\overline{x}-x_g^k, t_k\rangle \mid x^k, x_f^{k+1}\bigr] &\leqslant 2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]- \mu\|x_f^{k+1}-\overline{x}\|^2-\theta\|\nabla r(x_f^{k+1})\|^2 \\ &\qquad+3\theta\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2+ 3\theta\|\nabla p(x_g^k)-s_k\|^2 \\ &\qquad-\frac{1}{3\theta}\Bigl\|x_g^k- \arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2 +\frac{2\theta}{3}\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2. \end{aligned} \end{equation*} \notag $$

Now we are ready to construct an expression with the convergence criterion:

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\bigl[2\langle\overline{x}-x_g^k,t_k\rangle\mid x^k,x_f^{k+1}\bigr] \leqslant 2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]- \mu\|x_f^{k+1}-\overline{x}\|^2-\theta\|\nabla r(x_f^{k+1})\|^2 \\ &\quad+\frac{11\theta}{3} \biggl(\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2- \frac{1}{11\theta^2}\Bigl\|x_g^k-\arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2\biggr)+3\theta\|\nabla p(x_g^k)-s_k\|^2 \\ &\ =2\bigl[r(\overline{x})-r(x_f^{k+1})\bigr]- \mu\|x_f^{k+1}-\overline{x}\|^2-\theta\|\nabla r(x_f^{k+1})\|^2 \\ &\quad+\frac{11\theta}{3} \biggl(\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2- \frac{9\delta^2}{11}\Bigl\| x_g^k-\arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2\biggr) +3\theta\|\nabla p(x_g^k)-s_k\|^2. \ \ \Box \end{aligned} \end{equation*} \notag $$

Let us move on to the proof of Lemma 1.

Lemma 3 (Lemma 1). Under Assumptions 1, 2, and 3, condition (4), and the tuning (3) we obtain

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\biggl[\frac{1}{\eta}\|x^{k+1}-x^*\|^2+ \frac{2}{\tau}\bigl[r(x_f^{k+1})-r(x^*)\bigr]\biggr] \\ &\qquad\leqslant\biggl(1-\alpha \eta +\frac{12\theta\delta^2\eta}{B} \cdot\zeta\biggr) \mathsf{E}\biggl[\frac{1}{\eta}\|x^k-x^* \|^2\biggr]+ (1-\tau)\mathsf{E}\biggl[\frac{2}{\tau}\bigl[r(x_f^k)-r(x^*)\bigr]\biggr] \\ &\qquad\qquad+\mathsf{E}\biggl[\biggl(\frac{12\theta\delta^2}{B\tau}- \frac{\mu}{6}\biggr) \cdot\zeta \bigl(\|x_f^k-x_*\|^2+ \|x_f^{k+1}-x_*\|^2\bigr)\biggr]+\frac{4\theta}{B\tau}\sigma^2. \end{aligned} \end{equation*} \notag $$

Proof. Using Line 27 of Algorithm 1 we write

$$ \begin{equation*} \begin{aligned} \, \frac{1}{\eta}\,\|x^{k+1}-x^*\|^2&=\frac{1}{\eta}\|x^k-x^*\|^2+ \frac{2}{\eta}\langle x^{k+1}-x^k,x^k-x^*\rangle+ \frac{1}{\eta}\|x^{k+1}-x^k\|^2 \\ &=\frac{1}{\eta}\|x^k-x^*\|^2+2\alpha\langle x_f^{k+1}-x^k,x^k-x^*\rangle- 2\langle t_k,x^k-x^* \rangle \\ &\qquad+\frac{1}{\eta}\|x^{k+1}-x^k\|^2 \\ &=\frac{1}{\eta}\|x^k-x^*\|^2+\alpha\|x_f^{k+1}-x^*\|^2-\alpha\|x^k-x^*\|^2 \\ &\qquad-\alpha\|x_f^{k+1}-x^k\|^2-2\langle t_k,x^k-x^*\rangle+ \frac{1}{\eta}\|x^{k+1}-x^k\|^2; \end{aligned} \end{equation*} \notag $$

here we have used the properties of the norm to write out the inner product. Line 5 of Algorithm 1 gives

$$ \begin{equation*} \begin{aligned} \, \frac{1}{\eta}\|x^{k+1}-x^*\|^2&=\biggl(\frac{1}{\eta}-\alpha\biggr) \|x^k-x^*\|^2+\alpha\|x_f^{k+1}-x^*\|^2+\frac{1}{\eta}\|x^{k+1}-x^k\|^2 \\ &\qquad-\alpha\|x_f^{k+1}-x^k\|^2+2\langle t_k,x^*-x_g^k\rangle+ \frac{2(1-\tau)}{\tau}\langle t_k,x_f^k-x_g^k\rangle. \end{aligned} \end{equation*} \notag $$

Using (11) for $\overline{x}=x^*$ and $\overline{x}=x_f^k$ we obtain

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\biggl[\frac{1}{\eta}\|x^{k+1}-x^*\|^2\mid x^k,x_f^{k+1}\biggr] \leqslant\biggl(\frac{1}{\eta}-\alpha\biggr)\|x^k-x^*\|^2+ \alpha\|x_f^{k+1}-x^*\|^2 \\ &\qquad+\mathsf{E}\biggl[\frac{1}{\eta}\|x^{k+1}-x^k\|^2\mid x^k, x_f^{k+1}\biggr]-\alpha\|x_f^{k+1}-x^k\|^2 \\ &\qquad+2\bigl[r(x^*)-r(x_f^{k+1})\bigr]-\mu\| x_f^{k+1}-x^*\|^2 \\ &\qquad+\frac{2(1-\tau)}{\tau}\bigl[r(x_f^{k})-r(x_f^{k+1})\bigr]- \frac{\theta}{\tau}\|\nabla r(x_f^{k+1})\|^2 \\ &\qquad-\mu\|x_f^{k+1}-x_f^{k}\|^2+\frac{3\theta}{\tau}\|\nabla p(x_g^k)-s_k\|^2 \\ &\qquad+\frac{11\theta}{3\tau} \biggl(\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2- \frac{9\delta^2}{11}\Bigl\|x_g^k-\arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2\biggr). \end{aligned} \end{equation*} \notag $$

The last expression

$$ \begin{equation*} \frac{11\theta}{3\tau}\biggl(\|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2- \frac{9\delta^2}{11}\Bigl\|x_g^k-\arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2\biggr), \end{equation*} \notag $$

can be neglected in view of condition (4). Consider the expectation separately:

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\biggl[\frac{1}{\eta}\|x^{k+1}-x^k\|^2\Bigm| x^k,x_f^{k+1}\biggr]&= \frac{1}{\eta}\mathsf{E}\bigl[\|\eta\alpha(x_f^{k+1}-x^k)-\eta t_k\|^2\mid x^k,x_f^{k+1}\bigr] \\ &\leqslant 3\eta\alpha^2\|x_f^{k+1}-x^k\|^2+3\eta\|\nabla r(x_f^{k+1})\|^2 \\ &\qquad+3\eta\mathsf{E}\bigl[\|\nabla r(x_f^{k+1})-t_k\|^2\mid x^k,x_f^{k+1}\bigr]. \end{aligned} \end{equation*} \notag $$

Here we have added and subtracted $\eta\nabla r(x_f^{k+1})$ under the sign of norm. After rearranging the expressions, we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\biggl[\frac{1}{\eta}\|x^{k+1}-x^*\|^2\Bigm| x^k,&x_f^{k+1}\biggr]\leqslant\biggl(\frac{1}{\eta}-\alpha\biggr) \|x^k-x^*\|^2+\biggl(\alpha-\frac{\mu}{3}\biggr)\|x_f^{k+1}-x^*\|^2 \\ &+\alpha(3\eta\alpha-1)\| x_f^{k+1}-x^k \|^2+ \frac{3\theta}{\tau} \| \nabla p(x_g^k)-s_k \|^2 \\ &+\frac{2(1-\tau)}{\tau}\bigl[r(x_f^k)-r(x^*)\bigr]- \frac{2}{\tau}\bigl[r(x_f^{k+1})-r(x^*)\bigr] \\ &+3\eta\|\nabla r(x_f^{k+1})-t_k \|^2-\frac{\mu}{3}\|x_f^{k+1}-x_*\|^2 \\ &-\frac{\mu}{3}\|x_f^{k+1}-x_*\|^2-\mu\|x_f^{k+1}-x_f^k\|^2 \\ &+\biggl(3\eta-\frac{\theta}{\tau}\biggr)\|\nabla r(x_f^{k+1})\|^2. \end{aligned} \end{equation*} \notag $$

The choice of $\alpha$, $\eta$, and $\tau$ as defined by (3) gives

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\biggl[\frac{1}{\eta}\|x^{k+1}-x^*\|^2\Bigm| x^k,&x_f^{k+1}\biggr] \leqslant\biggl(\frac{1}{\eta}-\alpha\biggr)\|x^k-x^*\|^2+ \frac{2(1-\tau)}{\tau}\bigl[r(x_f^k)-r(x^*)\bigr] \\ &-\frac{2}{\tau}\bigl[r(x_f^{k+1})-r(x^*)\bigr]+ \frac{3\theta}{\tau}\|\nabla p(x_g^k)-s_k\|^2 \\ &+\frac{\theta}{\tau}\|\nabla r(x_f^{k+1})-t_k\|^2- \frac{\mu}{3}\|x_f^{k+1}-x_*\|^2 \\ &-\frac{\mu}{3}\|x_f^{k+1}-x_*\|^2-\mu\|x_f^{k+1}-x_f^k\|^2. \end{aligned} \end{equation*} \notag $$

Note that

$$ \begin{equation*} -\frac{\mu}{3}\|x_f^{k+1}-x_*\|^2\leqslant -\frac{\mu}{6}\|x_f^k-x_*\|^2+ \frac{\mu}{3}\|x_f^{k+1}-x_f^k\|^2. \end{equation*} \notag $$

Here we need the following technical lemma.

Lemma 4. Let $p(x)=r(x)-r_1(x)$ and

$$ \begin{equation*} s_k=\frac{1}{B}\sum_{i=1}^B\bigl(\nabla r_{m_i}(x_g^k,\xi)- \nabla r_1(x_g^k)\bigr). \end{equation*} \notag $$

Then

$$ \begin{equation*} \mathsf{E}_{\xi,i}\bigl[\|s_k-\nabla p(x_g^k)\|^2\bigr] \leqslant \frac{2(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)}{B}+ \frac{4\delta^2}{B}\|x_f^{k+1}-x_*\|^2\cdot\zeta. \end{equation*} \notag $$

Proof. Let us apply Young’s inequality:

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}_{\xi,i}\bigl[\|s_k-\nabla p(x_g^k)\|^2\bigr] \\ &\qquad=\mathsf{E}_{\xi,i}\biggl[\,\biggl\|\nabla r(x_g^k)- \frac{1}{B}\sum_{i=1}^B\nabla r_{m_i}(x_g^k,\xi_{i}) \pm \frac{1}{B}\sum_{i=1}^B\nabla r_{m_i}(x_g^k)\biggr\|^2\,\biggr] \\ &\qquad\leqslant 2\mathsf{E}_{i}\biggl[\,\biggl\|\nabla r(x_g^k)- \frac{1}{B}\sum_{i=1}^B\nabla r_{m_i}(x_g^k)\biggr\|^2\,\biggr] \\ &\qquad\qquad+\frac{2}{B^2}\mathsf{E}_{i,\xi}\biggl\|\,\sum_{i=1}^B \bigl(\nabla r_{m_i}(x_g^k)-\nabla r_{m_i}(x_g^k,\xi_i)\bigr)\biggr\|^2. \end{aligned} \end{equation*} \notag $$

Next we look at the expression given by the second term:

$$ \begin{equation*} \begin{aligned} \, &\frac{2}{B^2}\sum_{i=1}^B\bigl\|\nabla r_{m_i}(x_g^k)- \nabla r_{m_i}(x_g^k,\xi_i)\bigr\|^2 \\ &\qquad+\frac{2}{B^2}\sum_{\substack{1\leqslant i,j \leqslant B\\i\ne j}} \bigl\langle \nabla r_{m_i}(x_g^k)-\nabla r_{m_i}(x_g^k,\xi_i), \nabla r_{m_j}(x_g^k)-\nabla r_{m_j}(x_g^k,\xi_j)\bigr\rangle. \end{aligned} \end{equation*} \notag $$

Let us use the tower property and take the $\xi$-expectation first. Since $m_i$ and $m_j$ are independent, we can separately take the expectations of the factors in the second summand. Then only the first summand remains, which is estimated based on Assumption 2. Using Assumption 2 we obtain

$$ \begin{equation*} \mathsf{E}_{\xi,i}\bigl[\|s_k-\nabla p(x_g^k)\|^2\bigr]\leqslant \frac{2(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)}{B}+ \frac{4\delta^2}{B}\|x_f^{k+1}-x_*\|^2\cdot\zeta. \quad\Box \end{equation*} \notag $$

We return to the proof of Lemma 1. Set

$$ \begin{equation*} \sigma^2=2(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2). \end{equation*} \notag $$

Taking the expectation and using (4) we have

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\bigg[\frac{1}{\eta}\| x^{k+1}-x^* \|^2+ \frac{2}{\tau}\bigl[r(x_f^{k+1})-r(x^*)\bigr]\bigg] \\ &\qquad\leqslant\mathsf{E}\biggl[\biggl(\frac{1}{\eta}-\alpha+ \frac{12\theta\delta^2}{B}\cdot\zeta\biggr)\|x^k-x^*\|^2\biggr] \\ &\qquad+\mathsf{E}\biggl[\frac{2(1-\tau)}{\tau} \bigl[r(x_f^k)-r(x^*)\bigr]\biggr] \\ &\qquad+\mathsf{E}\biggl[\biggl(\frac{12\theta\delta^2}{B\tau}- \frac{\mu}{6}\biggr)\|x_f^k-x_*\|^2\cdot\zeta\biggr] \\ &\qquad+\mathsf{E}\biggl[\biggl(\frac{4\theta\delta^2}{B\tau}- \frac{\mu}{3}\biggr)\|x_f^{k+1}-x_*\|^2\cdot\zeta\biggr] \\ &\qquad+\frac{4\theta}{B\tau}\sigma^2, \end{aligned} \end{equation*} \notag $$

which completes the proof of the lemma. $\Box$

Appendix B. Proof of Theorem 2

Theorem 5 (Theorem 2). In Lemma 1 choose $\zeta=1$ and the tuning (6). Then

$$ \begin{equation*} \mathsf{E}[\varPhi_{k+1}]\leqslant \biggl(1-\frac{\sqrt{\mu\theta}}{18}\biggr)\mathsf{E}[\varPhi_k]+ \frac{4\theta}{B}\,, \end{equation*} \notag $$

where

$$ \begin{equation*} \varPhi_k=\frac{\tau}{\eta}\|x^k-x^*\|^2+2\bigl[r(x_f^k)-r(x^*)\bigr]. \end{equation*} \notag $$

Proof. We consider the possible values of $\eta$ one by one.

$\bullet$ Let $\eta=1/\mu$. In this case we have

$$ \begin{equation*} 1-\eta\alpha+\frac{12\eta\theta\delta^2}{B}\leqslant \frac{2}{3}+\frac{12\delta^2}{B\mu}\,\frac{\mu B\tau}{72\delta^2}\leqslant \frac{5}{6}=1-\frac{1}{6}\,. \end{equation*} \notag $$

$\bullet$ Let $\eta=\theta/(3\tau)$. Then

$$ \begin{equation*} \begin{aligned} \, 1-\eta\alpha+\frac{12\eta\theta\delta^2}{B}&=1-\frac{\mu\theta}{9\tau}+ \frac{4\theta^2\delta^2}{B\tau}\leqslant 1-\frac{\mu\theta}{9\tau}+ \frac{4\theta\delta^2}{B\tau}\,\frac{\mu B\tau}{72\delta^2} \leqslant 1-\frac{\mu\theta}{18\tau} \\ &= 1-\frac{\sqrt{\mu\theta}}{18}\,. \end{aligned} \end{equation*} \notag $$

Thus,

$$ \begin{equation*} \mathsf{E}[\varPhi_{k+1}]\leqslant\max\biggl\{1-\frac{1}{6}\,, 1-\frac{\sqrt{\mu\theta}}{18}\,\biggr\}. \end{equation*} \notag $$

Since $\mu<\delta$, we obtain the required result. $\Box$

Appendix C. Proof of Corollaries 1 and 2

Here we present a proof of Corollaries 1 and 2. We use the technique from [37].

Proof. We obtained before the recurrence formula

$$ \begin{equation*} 0\leqslant (1-a\sqrt{\theta}\,)\mathsf{E}[\varPhi_{k}]- \mathsf{E}[\varPhi_{k+1}]+c\theta. \end{equation*} \notag $$

Let us multiply both sides by the weight $w_k=(1-a\sqrt{\theta}\,)^{-(k+1)}$ and rewrite:

$$ \begin{equation*} 0\leqslant \frac{1}{\sqrt{\theta}}\bigl(\mathsf{E}[\varPhi_{k}]w_{k-1}- \mathsf{E}[\varPhi_{k+1}]w_k\bigr)+cw_k\sqrt{\theta}\,. \end{equation*} \notag $$

We set

$$ \begin{equation*} W_N=\sum_{k=0}^Nw_k(1-\sqrt{\theta}\,)^{-(N+1)}\sum_{k=0}^N (1-a\sqrt{\theta}\,)^k\leqslant\frac{w_N}{a\sqrt{\theta}} \end{equation*} \notag $$

and sum up for $k=0,\dots,N$:

$$ \begin{equation*} \frac{w_N\mathsf{E}[\varPhi_{k+1}]}{\sqrt{\theta}\,W_N}\leqslant \frac{\varPhi_{0}}{\sqrt{\theta}\,W_N}+c\sqrt{\theta}\,. \end{equation*} \notag $$

We rewrite this as

$$ \begin{equation*} \mathsf{E}[\varPhi_{k+1}]\leqslant \frac{1}{a\sqrt{\theta}} \exp\{-a\sqrt{\theta}\,(N+1)\}+\frac{c\sqrt{\theta}}{a}\,. \end{equation*} \notag $$

To complete the proof we need to substitute

$$ \begin{equation*} a=\frac{\sqrt{\mu}}{3}\quad\text{and}\quad c=\frac{4\sigma^2}{B} \end{equation*} \notag $$

in the case of Corollary 1 and

$$ \begin{equation*} a=\frac{\sqrt{\mu}}{18}\quad\text{and}\quad c=\frac{4\sigma^2}{B} \end{equation*} \notag $$

in the case of Corollary 2 and look at different $\theta$.

For Corollary 1 we have the following.

$$ \begin{equation*} \frac{9\ln^2(\max\{2,B\mu N^2 \varPhi_0/(36\sigma^2)\})}{\mu N^2} \leqslant\frac{1}{3\delta}\,, \end{equation*} \notag $$

where $\sigma^2=2(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)$, then we choose

$$ \begin{equation*} \theta=\frac{9\ln^2(\max\{2,B\mu N^2\varPhi_0/(36\sigma^2)\})}{\mu N^2} \end{equation*} \notag $$

and obtain

$$ \begin{equation*} \mathsf{E}[\varPhi_{N+1}]=\mathcal{O}\biggl(\frac{24(\sigma_{\rm sim}^2+ \sigma_{\rm noise}^2)}{\mu BN}\biggr). \end{equation*} \notag $$

Otherwise, we choose $\theta=1/(3\delta)$ and obtain

$$ \begin{equation*} \mathsf{E}[\varPhi_{N+1}]=\mathcal{O}\biggl(\sqrt{3\delta}\,\varPhi_0 \exp\biggl\{-\frac{\sqrt{\mu}}{3\sqrt{3}\,\sqrt{\delta}}\,N\biggr\}+ \frac{24(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)}{\mu BN}\biggr). \end{equation*} \notag $$

It remains to recall that an iteration of the method requires $B/M$ communications.

For Corollary 2 we have the following.

$\bullet$ If

$$ \begin{equation*} \frac{1}{3\delta}\leqslant\frac{\mu^3B^2}{5184\delta^4}\quad\text{and}\quad \frac{324\ln^2(\max\{2,B\mu N^2\varPhi_0/(1296\,\sigma^2)\})}{\mu N^2} \leqslant\frac{1}{3\delta}\,, \end{equation*} \notag $$

where $\sigma^2=2(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)$, then we choose

$$ \begin{equation*} \theta=\frac{324\ln^2(\max\{2,B\mu N^2 \varPhi_0/(1296\,\sigma^2)\})}{\mu N^2} \end{equation*} \notag $$

and obtain

$$ \begin{equation*} \mathsf{E}[\varPhi_{N+1}]= O\biggl(\frac{144(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)}{\mu BN}\biggr). \end{equation*} \notag $$

$\bullet$ If

$$ \begin{equation*} \frac{1}{3\delta}\leqslant\frac{\mu^3B^2}{5184\delta^4}\quad\text{and}\quad \frac{324\ln^2(\max\{2,B\mu N^2\varPhi_0/(1296\sigma^2)\})}{\mu N^2} \geqslant\frac{1}{3\delta}\,, \end{equation*} \notag $$

where $\sigma^2=2(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)$, then we choose

$$ \begin{equation*} \theta=\frac{1}{3\delta} \end{equation*} \notag $$

and obtain

$$ \begin{equation*} \mathsf{E}[\varPhi_{N+1}]=\mathcal{O}\biggl(\sqrt{3\delta}\,\varPhi_0 \exp\biggl\{-\frac{\sqrt{\mu}}{18\sqrt{3}\,\sqrt{\delta}}N\biggr\}+ \frac{144(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)}{\mu BN}\biggr). \end{equation*} \notag $$

$\bullet$ If

$$ \begin{equation*} \frac{1}{3\delta}\geqslant\frac{\mu^3B^2}{5184\,\delta^4}\quad\text{and}\quad \frac{324\ln^2(\max\{2,B\mu N^2\varPhi_0/(1296\,\sigma^2)\})}{\mu N^2} \leqslant\frac{\mu^3B^2}{5184\,\delta^4}\,, \end{equation*} \notag $$

where $\sigma^2=2(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)$, then we choose

$$ \begin{equation*} \theta=\frac{324\ln^2(\max\{2,B\mu N^2\varPhi_0/(1296\,\sigma^2)\})}{\mu N^2} \end{equation*} \notag $$

and obtain

$$ \begin{equation*} \mathsf{E}[\varPhi_{N+1}]=\mathcal{O}\biggl(\frac{144(\sigma_{\rm sim}^2+ \sigma_{\rm noise}^2)}{\mu BN}\biggr). \end{equation*} \notag $$

$\bullet$ If

$$ \begin{equation*} \frac{1}{3\delta}\geqslant\frac{\mu^3B^2}{5184\,\delta^4}\quad\text{and}\quad \frac{324\ln^2(\max\{2,B\mu N^2\varPhi_0/(1296\,\sigma^2)\})}{\mu N^2} \geqslant\frac{\mu^3B^2}{5184\,\delta^4}\,, \end{equation*} \notag $$

where $\sigma^2=2(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)$, then we choose

$$ \begin{equation*} \theta=\frac{324\ln^2(\max\{2,B\mu N^2 \varPhi_0/(1296\,\sigma^2)\})}{\mu N^2} \end{equation*} \notag $$

and obtain

$$ \begin{equation*} \mathsf{E}[\varPhi_{N+1}]=\mathcal{O}\biggl(\frac{72\delta^2}{B\sqrt{\mu^3}} \varPhi_0\exp\biggl\{-\frac{B}{1296}\,\frac{\mu^2}{\delta^2}\,N\biggr\}+ \frac{144(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)}{\mu BN}\biggr). \end{equation*} \notag $$

As in the case of Corollary 1, an iteration of the method requires $B/M$ communications. $\Box$

Appendix D. $\texttt{ASEG}$ for the convex case

The next algorithm is an adaptation of Algorithm 1 to the convex case. We use time-varying $\tau_k$ and $\eta_k$.

Theorem 6. Let Algorithm 2 run for $N$ iterations under Assumption 1 for $\mu= 0$, Assumption 2 for $\zeta=0$, and Assumption 3, with the following tuning:

$$ \begin{equation*} \tau_k=\frac{2}{k+1}\,, \quad \theta\leqslant\frac{1}{3\delta}\,,\quad \eta_k=\frac{\theta}{\tau_k}\,, \end{equation*} \notag $$

and let $x_f^{k+1}$ in Line 16 satisfy

$$ \begin{equation} \|\nabla \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x_f^{k+1})\|^2 \leqslant \frac{9\delta^2}{11}\,\Bigl\|x_g^k- \arg\,\min_{x \in \mathbb{R}^d} \kern1.7pt\overline{\vphantom{A}\kern5.6pt}\kern-7.4pt A\kern0.1pt _\theta^k(x)\Bigr\|^2. \end{equation} \tag{12} $$

Then the following inequality holds:

$$ \begin{equation*} \mathsf{E}\biggl[\frac{\theta}{\tau_{N+1}^2}\bigl[r(x_f^{N+1})- r(x_*)\bigr]\biggr]\leqslant \|x_0-x_*\|^2+\frac{6(\sigma_{\rm sim}^2+ \sigma_{\rm noise}^2)}{B}\,\theta^2\sum_{i=1}^N\frac{1}{\tau_{k}^2}\,. \end{equation*} \notag $$

Proof. Set $\sigma^2=2(\sigma_{\rm sim}^2+\sigma_{\rm noise}^2)$ again. Here we apply Lemma 2 for $\mu=0$ and $\zeta=0$:

$$ \begin{equation*} \begin{aligned} \, &\frac{1}{\eta_{k+1}}\mathsf{E}\bigl[\|x_{k+1}-x_*\|^2\mid x_k\bigr] \leqslant \frac{1}{\eta_{k+1}}\|x_k-x_*\|^2+ \frac{1}{\eta_{k+1}}\,\mathsf{E}\bigl[\|x_{k+1}-x_k\|^2\mid x_k\bigr] \\ &\qquad\qquad+2\bigl[r(x_*)-r(x_f^{k+1})\bigr]+ \frac{2(1-\tau_{k+1})}{\tau_{k+1}}\bigl[r(x_f^k)-r(x_*)\bigr] \\ &\qquad\qquad-\frac{\theta}{\tau_{k+1}}\|\nabla r(x_f^{k+1})\|^2+ \frac{3\theta}{B\tau_{k+1}}\sigma^2 \\ &\qquad\leqslant \frac{1}{\eta_{k+1}}\,\|x_k-x_*\|^2+ \frac{\eta_{k+1}}{B}\sigma^2+\frac{3\theta}{B\tau_{k+1}}\sigma^2 \\ &\qquad\qquad+\biggl(\eta_{k+1}-\frac{\theta}{\tau_{k+1}}\biggr) \|\nabla r(x_f^{k+1})\|^2+\frac{2}{\tau_{k+1}}\bigl[r(x_*)-r(x_f^{k+1})\bigr] \\ &\qquad\qquad+\frac{2(1-\tau_{k+1})}{\tau_{k+1}}\bigl[r(x_f^k)-r(x_*)\bigr]. \end{aligned} \end{equation*} \notag $$

We substitute $\eta_{k+1}=\theta/(2\tau_{k+1})$:

$$ \begin{equation*} \begin{aligned} \, &\frac{\tau_{k+1}}{\theta}\mathsf{E}\bigl[\|x_{k+1}-x_*\|^2\mid x_k\bigr] \leqslant \frac{\tau_{k+1}}{\theta}\|x_k-x_*\|^2+ \frac{\theta}{4B\tau_{k+1}}\sigma^2+\frac{3\theta}{2B\tau_{k+1}}\sigma^2 \\ &\qquad\qquad+\frac{1}{\tau_{k+1}}\bigl[r(x_*)-r(x_f^{k+1})\bigr]+ \frac{1-\tau_{k+1}}{\tau_{k+1}}\bigl[r(x_f^k)-r(x_*)\bigr]. \end{aligned} \end{equation*} \notag $$

We multiply both sides by $\theta/(\tau_{k+1})$:

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\bigl[\|x_{k+1}-x_*\|^2\mid x_k\bigr] \\ &\qquad\leqslant \|x_k-x_*\|^2+\frac{3\theta^2}{B\tau_{k+1}^2}\sigma^2+ \frac{\theta}{\tau_{k+1}^2}\bigl[r(x_*)-r(x_f^{k+1})\bigr] \\ &\qquad\qquad+\frac{\theta(1-\tau_{k+1})}{\tau_{k+1}^2} \bigl[r(x_f^k)-r(x_*)\bigr]. \end{aligned} \end{equation*} \notag $$

Let us rewrite the last inequality:

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}\bigl[\|x_{k+1}-x_*\|^2\mid x_k\bigr]+ \frac{\theta}{\tau_{k+1}^2}\bigl[r(x_f^{k+1})-r(x_*)\bigr] \\ &\qquad\leqslant \|x_k-x_*\|^2+\frac{3\theta^2}{B\tau_{k+1}^2}\sigma^2 +\frac{\theta(1-\tau_{k+1})}{\tau_{k+1}^2} \bigl[r(x_f^k)-r(x_*)\bigr]. \end{aligned} \end{equation*} \notag $$

Define

$$ \begin{equation*} \varPhi_k=\|x_{k}-x_*\|^2+\frac{\theta}{\tau_{k}^2}\bigl[r(x_f^k)-r(x_*)\bigr]. \end{equation*} \notag $$

Using $\tau_k=2/(k+1)$ we obtain

$$ \begin{equation*} \frac{\theta}{\tau_{k+1}^2}\bigl[r(x_f^{k+1})-r(x_*)\bigr] \leqslant \varPhi_k+\frac{3\theta^2}{B\tau_{k+1}^2}\sigma^2. \end{equation*} \notag $$

Next we apply the previous inequality and assume that $\tau_1=1$:

$$ \begin{equation*} \mathsf{E}\biggl[\frac{\theta}{\tau_{N+1}^2}\bigl[r(x_f^{N+1})-r(x_*)\bigr]\biggr] \leqslant \|x_0-x_*\|^2+\frac{3\sigma^2\theta^2}{B}\sum_{i=1}^N \frac{1}{\tau_{k}^2}\,. \quad \Box \end{equation*} \notag $$

Similarly to the case where $\mu>0$, it is not possible to ensure convergence by a naive choice of $\theta$. We propose a choice of $\theta$ that ensures optimal convergence.

Corollary 4. Consider Assumption 1 for $\mu=0$, Assumption 2 for $\zeta=0$, and Assumption 3. Let Algorithm 2 run for $N$ iterations with the tuning of $\tau_k$ and $\eta_k$ as in Theorem 6. Then the following tuning of $\theta$ is proposed.

$$ \begin{equation*} \frac{\sqrt{B}\,\|x_0-x_*\|}{\sqrt{3}\,\sigma(N+1)^{3/2}} \leqslant \frac{1}{3\delta}\,, \end{equation*} \notag $$

then choose

$$ \begin{equation*} \theta=\frac{\sqrt{B}\,\|x_0-x_*\|}{\sqrt{3}\,\sigma(N+1)^{3/2}} \end{equation*} \notag $$

and obtain

$$ \begin{equation*} \mathsf{E}\bigl[r(x_f^{N+1})-r(x_*)\bigr] \leqslant \frac{2\sqrt{3}\,\sigma\|x_0-x_*\|}{\sqrt{B}\,\sqrt{N+1}}\,. \end{equation*} \notag $$

Otherwise, choose $\theta=1/(3\delta)$ and obtain

$$ \begin{equation*} \mathsf{E}\bigl[r(x_f^{N+1})-r(x_*)\bigr] \leqslant \frac{3\delta\|x_0-x_*\|^2}{(N+1)^2}+ \frac{\sqrt{3}\,\sigma\|x_0-x_*\|}{\sqrt{B}\,\sqrt{N+1}}\,. \end{equation*} \notag $$

Proof. Note that $1/\tau_{k}^2\leqslant (N+1)^2/4$. Thus, using Theorem 6 we obtain

$$ \begin{equation*} \mathsf{E}\bigl[\theta(N+1)^2\bigl[r(x_f^{N+1})-r(x_*)\bigr]\bigr] \leqslant \|x_0-x_*\|^2+\frac{3\sigma^2}{B}\theta^2(N+1)^3. \end{equation*} \notag $$

We divide both sides by $\theta(N+1)^2$:

$$ \begin{equation*} \mathsf{E}\bigl[r(x_f^{N+1})-r(x_*)\bigr] \leqslant \frac{\|x_0-x_*\|^2}{\theta(N+1)^2}+\frac{3\sigma^2}{B}\theta(N+1). \end{equation*} \notag $$

Notice that $\theta= \dfrac{\sqrt{B}\,\|x_0-x_*\|}{\sqrt{3}\,\sigma(N+1)^{3/2}}$ equalizes the terms on the right-hand side. If $\dfrac{\sqrt{B}\,\|x_0-x_*\|}{\sqrt{3}\,\sigma(N+1)^{3/2}} \leqslant \dfrac{1}{3\delta}$ , we choose $\theta= \dfrac{\sqrt{B}\,\|x_0-x_*\|}{\sqrt{3}\,\sigma(N+1)^{3/2}}$ . Then we obtain

$$ \begin{equation*} \mathsf{E}\bigl[r(x_f^{N+1})-r(x_*)\bigr] \leqslant \frac{2\sqrt{3}\,\sigma\|x_0-x_*\|}{\sqrt{B}\,\sqrt{N+1}}\,. \end{equation*} \notag $$

Otherwise, we choose $\theta=1/(3\delta)$ and obtain

$$ \begin{equation*} \mathsf{E}\bigl[r(x_f^{N+1})-r(x_*)\bigr] \leqslant \frac{3\delta\|x_0-x_*\|^2}{(N+1)^2}+ \frac{\sqrt{3}\,\sigma\|x_0-x_*\|}{\sqrt{B}\,\sqrt{N+1}}\,. \quad\Box \end{equation*} \notag $$



Bibliography

1.	M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and Li Zhang, “Deep learning with differential privacy”, Proceedings of the 2016 ACM SIGSAC conference on computer and communications security (Vienna 2016), ACM, New York, 2016, 308–318
2.	M. Aitsam, “Differential privacy made easy”, 2022 International conference on emerging trends in electrical, control, and telecommunication engineering (ETECTE) (Lahore 2022), IEEE, 2022, 1–7
3.	D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding”, NIPS'17: Proceedings of the 31st international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 30, Curran Associates, Inc., Red Hook, NY, 2017, 1707–1718, https://proceedings.neurips.cc/paper/2017/hash/6c340f25839e6acdc73414517203f5f0-Abstract.html
4.	Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochastic gradient methods”, J. Mach. Learn. Res., 18 (2018), 221, 51 pp.
5.	Y. Arjevani and O. Shamir, “Communication complexity of distributed convex learning and optimization”, NIPS'15: Proceedings of the 28th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 28, MIT Press, Cambridge, MA, 2015, 1756–1764 https://papers.nips.cc/paper_files/paper/2015/hash/7fec306d1e665bc9c748b5d2b99a6e97-Abstract.html
6.	B. Barazandeh, Tianjian Huang, and G. Michailidis, “A decentralized adaptive momentum method for solving a class of min-max optimization problems”, Signal Process., 189 (2021), 108245, 23 pp.
7.	A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, and A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2762, 38116–38133 https://proceedings.neurips.cc/paper_files/paper/2022/hash/f9379afacdbabfdc6b060972b60f9ab8-Abstract-Conference.html
8.	A. Beznosikov and A. Gasnikov, “Compression and data similarity: combination of two techniques for communication-efficient solving of distributed variational inequalities”, Optimization and applications, Lecture Notes in Comput. Sci., 13781, Springer, Cham, 2022, 151–162
9.	A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan, “On biased compression for distributed learning”, J. Mach. Learn. Res., 24 (2023), 276, 50 pp.
10.	A. Beznosikov, G. Scutari, A. Rogozin, and A. Gasnikov, “Distributed saddle-point problems under similarity”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 625, 8172–8184 https://proceedings.neurips.cc/paper/2021/hash/44e65d3e9bc2f88b2b3d566de51a5381-Abstract.html
11.	A. Beznosikov, M. Takác, and A. Gasnikov, “Similarity, compression and local steps: three pillars of efficient communications for distributed variational inequalities”, NIPS'23: Proceedings of the 37th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 36, Curran Associates, Inc., Red Hook, NY, 2023, 1246, 28663–28677 https://proceedings.neurips.cc/paper_files/paper/2023/hash/5b4a459db23e6db9be2a128380953d96-Abstract-Conference.html
12.	Chih-Chung Chang and Chih-Jen Lin, “LIBSVM: A library for support vector machines”, ACM Trans. Intell. Syst. Technol. (TIST), 2:3 (2011), 27, 27 pp.
13.	A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives”, NIPS'14: Proceedings of the 27th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 27, MIT Press, Cambridge, MA, 2014, 1646–1654, https://proceedings.neurips.cc/paper_files/paper/2014/hash/ede7e2b6d13a41ddf9f4bdef84fdc737-Abstract.html
14.	Jiashi Feng, Huan Xu, and Shie Mannor, Distributed robust learning, 2015 (v1 – 2014), 18 pp., arXiv: 1409.5937
15.	S. Gade and N. H. Vaidya, “Private optimization on networks”, 2018 Annual american control conference (ACC) (Milwaukee, WI 2018), IEEE, 2018, 1402–1409
16.	E. Gorbunov, F. Hanzely, and P. Richtárik, “A unified theory of SGD: variance reduction, sampling, quantization and coordinate descent”, Proceedings of the 23rd international conference on artificial intelligence and statistics (AISTATS 2020), Proc. Mach. Learn. Res. (PMLR), 108, 2020, 680–690 https://proceedings.mlr.press/v108/gorbunov20a.html; 2019, 38 pp., arXiv: 1905.11261
17.	E. Gorbunov, F. Hanzely, and P. Richtárik, “Local SGD: unified theory and new efficient methods”, Proceedings of the 24th international conference on artificial intelligence and statistics (AISTATS 2021), Proc. Mach. Learn. Res. (PMLR), 130, 2021, 3556–3564 https://proceedings.mlr.press/v130/gorbunov21a.html
18.	T. M. Hehn, J. F. Kooij, and F. A. Hamprecht, “End-to-end learning of decision trees and forests”, Int. J. Comput. Vis., 128:4 (2020), 997–1011
19.	H. Hendrikx, Lin Xiao, S. Bubeck, F. Bach, and L. Massoulie, “Statistically preconditioned accelerated gradient method for distributed optimization”, Proceedings of the 37th international conference on machine learning (ICML 2020), Proc. Mach. Learn. Res. (PMLR), 119, 2020, 4203–4227, https://proceedings.mlr.press/v119/hendrikx20a.html
20.	R. Johnson and Tong Zhang, “Accelerating stochastic gradient descent using predictive variance reduction”, NIPS'13: Proceedings of the 26th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 26, Curran Associates Inc., Red Hook, NY, 2013, 315–323, https://proceedings.neurips.cc/paper_files/paper/2013/hash/ac1dd209cbcc5e5d1c6e28598e8cbbe8-Abstract.html
21.	A. Khaled and Chi Jin, Faster federated optimization under second-order similarity, 2023 (v1 – 2022), 44 pp., arXiv: 2209.02257
22.	A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local SGD on identical and heterogeneous data”, Proceedings of the 23rd international conference on artificial intelligence and statistics (AISTATS 2020), Proc. Mach. Learn. Res. (PMLR), 108, 2020, 4519–4529 https://proceedings.mlr.press/v108/bayoumi20a.html
23.	A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates”, Proceedings of the 37th international conference on machine learning (ICML 2020), Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5381–5393, https://proceedings.mlr.press/v119/koloskova20a.html
24.	D. Kovalev, A. Beznosikov, E. Borodich, A. Gasnikov, and G. Scutari, “Optimal gradient sliding and its application to optimal distributed optimization under similarity”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2427, 33494–33507, https://proceedings.neurips.cc/paper_files/paper/2022/hash/d88f6f81e1aaf606776ffdd06fdf24ef-Abstract-Conference.html
25.	D. Kovalev, S. Horváth, and P. Richtárik, “Don't jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop”, Proceedings of the 31st international conference on algorithmic learning theory (ALT 2020), Proc. Mach. Learn. Res. (PMLR), 117, 2020, 451–467 https://proceedings.mlr.press/v117/kovalev20a.html
26.	Zhize Li, Hongyan Bao, Xiangliang Zhang, and P. Richtárik, “PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization”, Proceedings of the 38th international conference on machine learning (ICML 2021), Proc. Mach. Learn. Res. (PMLR), 139, 2021, 6286–6295 https://proceedings.mlr.press/v139/li21a.html
27.	Dachao Lin, Yuze Han, Haishan Ye, and Zhihua Zhang, “Stochastic distributed optimization under average second-order similarity: algorithms and analysis”, NIPS'23: Proceedings of the 37th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 36, Curran Associates, Inc., Red Hook, NY, 2023, 1849–1862 https://proceedings.neurips.cc/paper_files/paper/2023/hash/05e552739c2629f3324c1063a382b4bd-Abstract-Conference.html
28.	Hongzhou Lin, J. Mairal, and Z. Harchaoui, “A universal catalyst for first-order optimization”, NIPS'15: Proceedings of the 28th international conference on neural information processing systems, v. 2, Adv. Neural Inf. Process. Syst., 28, MIT Press, Cambridge, MA, 2015, 3384–3392, https://proceedings.neurips.cc/paper_files/paper/2015/hash/c164bbc9d6c72a52c599bbb43d8db8e1-Abstract.html
29.	B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data”, Proceedings of the 20th international conference on artificial intelligence and statistics (AISTATS 2017), Proc. Mach. Learn. Res. (PMLR), 54, 2017, 1273–1282 https://proceedings.mlr.press/v54/mcmahan17a
30.	J. M. Moguerza and A. Muñoz, “Support vector machines with applications”, Statist. Sci., 21:3 (2006), 322–336
31.	L. M. Nguyen, Jie Liu, K. Scheinberg, and M. Takáč, “SARAH: A novel method for machine learning problems using stochastic recursive gradient”, Proceedings of the 34th international conference on machine learning (ICML 2017), Proc. Mach. Learn. Res. (PMLR), 70, 2017, 2613–2621 https://proceedings.mlr.press/v70/nguyen17b.html
32.	E. Nozari, P. Tallapragada, and J. Cortés, “Differentially private distributed convex optimization via functional perturbation”, IEEE Trans. Control Netw. Syst., 5:1 (2018), 395–408
33.	S. S. Sannakki and A. A. Kulkarni, “A review paper on artificial neural network in cognitive science”, Int. J. Engrg. Techn., 2:2 (2016), 118–123 https://oaji.net/articles/2017/1992-1514450980.pdf
34.	S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence”, J. Mach. Learn. Res., 11 (2010), 2635–2670
35.	O. Shamir, N. Srebro, and Tong Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method”, Proceedings of the 31st international conference on machine learning (ICML 2014), Proc. Mach. Learn. Res. (PMLR), 32, 2014, 1000–1008 https://proceedings.mlr.press/v32/shamir14.html
36.	S. Sreenivasan, R. Cohen, E. López, Z. Toroczkai, and H. E. Stanley, Communication bottlenecks in scale-free networks, 2006, 5 pp., arXiv: cs/0604023
37.	S. U. Stich, Unified optimal analysis of the (stochastic) gradient method, 2019, 11 pp., arXiv: 1907.04232
38.	Ying Sun, G. Scutari, and A. Daneshmand, “Distributed optimization based on gradient tracking revisited: enhancing convergence rate via surrogation”, SIAM J. Optim., 32:2 (2022), 354–385
39.	Yг Tian, G. Scutari, Tianyu Cao, and A. Gasnikov, “Acceleration in distributed optimization under similarity”, Proceedings of the 25th international conference on artificial intelligence and statistics (AISTATS 2022), Proc. Mach. Learn. Res. (PMLR), 151, 2022, 5721–5756 https://proceedings.mlr.press/v151/tian22b.html
40.	G. Tripepi, K. J. Jager, F. W. Dekker, and C. Zoccali, “Linear and logistic regression analysis”, Kidney Int., 73:7 (2008), 806–810
41.	S. Vaswani, I. Laradji, F. Kunstner, Si Yi Meng, M. Schmidt, and S. Lacoste-Julien, Adaptive gradient methods converge faster with over-parameterization (but you should do a line-search), 2021 (v1 – 2020), 41 pp., arXiv: 2006.06835
42.	J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning”, ACM Comput. Surveys, 53:2 (2020), 30, 33 pp.
43.	P. C. Weeraddana and C. Fischione, “On the privacy of optimization”, IFAC-PapersOnLine, 50:1 (2017), 9502–9508
44.	B. E. Woodworth, K. K. Patel, and N. Srebro, “Minibatch vs local SGD for heterogeneous distributed learning”, NIPS'20: Proceedings of the 34th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 33, Curran Associates, Inc., Red Hook, NY, 2020, 527, 6281–6292, https://proceedings.neurips.cc/paper_files/paper/2020/hash/45713f6ff2041d3fdfae927b82488db8-Abstract.html
45.	Xiao-Tong Yuan and Ping Li, “On convergence of distributed approximate Newton methods: globalization, sharper bounds and beyond”, J. Mach. Learn. Res., 21 (2020), 206, 51 pp.

Citation: D. A. Bylinkin, K. D. Degtyarev, A. N. Beznosikov, “Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning”, Russian Math. Surveys, 79:6 (2024), 939–973

Citation in format AMSBIB

\Bibitem{BylDegBez24}

\by D.~A.~Bylinkin, K.~D.~Degtyarev, A.~N.~Beznosikov

\paper Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning

\jour Russian Math. Surveys

\yr 2024

\vol 79

\issue 6

\pages 939--973

\mathnet{http://mi.mathnet.ru/eng/rm10206}

\crossref{https://doi.org/10.4213/rm10206e}

\mathscinet{https://mathscinet.ams.org/mathscinet-getitem?mr=4867086}

\adsnasa{https://adsabs.harvard.edu/cgi-bin/bib_query?2024RuMaS..79..939B}

\isi{https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=Publons&SrcAuth=Publons_CEL&DestLinkType=FullRecord&DestApp=WOS_CPL&KeyUT=001443210000005}

\scopus{https://www.scopus.com/record/display.url?origin=inward&eid=2-s2.0-105000389423}

Linking options:

https://www.mathnet.ru/eng/rm10206

https://doi.org/10.4213/rm10206e

https://www.mathnet.ru/eng/rm/v79/i6/p5

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Statistics & downloads:
Abstract page:	421
Russian version PDF:	17
English version PDF:	37
Russian version HTML:	26
English version HTML:	149
References:	42
First page:	24

Registration to the website

Logotypes