A. E. Sadchikov, S. A. Chezhegov, A. N. Beznosikov, A. V. Gasnikov, “Local SGD for near-quadratic problems: Improving convergence under unconstrained noise conditions”, Russian Math. Surveys, 79:6 (2024), 1017

Russian Mathematical Surveys

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive
	Impact factor
	Submit a manuscript

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Uspekhi Mat. Nauk:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Russian Mathematical Surveys, 2024, Volume 79, Issue 6, Pages 1017–1049
DOI: https://doi.org/10.4213/rm10207e (Mi rm10207)

Local SGD for near-quadratic problems: Improving convergence under unconstrained noise conditions

A. E. Sadchikov^a, S. A. Chezhegov^ba, A. N. Beznosikov^bcd, A. V. Gasnikov^dba

^a Moscow Institute of Physics and Technology (National Research University), Moscow, Russia
^b Ivannikov Institute for System Programming of the Russian Academy of Sciences, Moscow, Russia
^c Sber AI Lab, Moscow, Russia
^d Innopolis University, Innopolis, Russia

English version PDF (837 kB) HTML full-text Russian version article

References:

PDF

HTML

DOI: https://doi.org/10.4213/rm10207e

Abstract: Distributed optimization plays an important role in modern large-scale machine learning and data processing systems by optimizing the utilization of computational resources. One of the classical and popular approaches is Local Stochastic Gradient Descent (Local SGD), characterized by multiple local updates before averaging, which is particularly useful in distributed environments to reduce communication bottlenecks and improve scalability. A typical feature of this method is the dependence on the frequency of communications. But in the case of a quadratic target function with homogeneous data distribution over all devices, the influence of the frequency of communications vanishes. As a natural consequence, subsequent studies include the assumption of a Lipschitz Hessian, as this indicates the similarity of the optimized function to a quadratic one to a certain extent. However, in order to extend the completeness of Local SGD theory and unlock its potential, in this paper we abandon the Lipschitz Hessian assumption by introducing a new concept of approximate quadraticity. This assumption gives a new perspective on problems that have near quadratic properties. In addition, existing theoretical analyses of Local SGD often assume a bounded variance. We, in turn, consider the unbounded noise condition, which allows us to broaden the class of problems under study.
Bibliography: 36 titles.

Keywords: distributed optimization, quadraticity, strong growth condition.

Funding agency

The work of A. V. Gasnikov was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and an agreement with the Ivannikov Institute for System Programming of the Russian Academy of Sciences, dated November 2, 2021, no. 70-2021-00142.

Received: 16.08.2024

Published: 20.02.2025

Bibliographic databases:

Document Type: Article

UDC: 519.853.62

MSC: 90C15

Language: English

Original paper language: English

1. Introduction

Stochastic Gradient Descent (SGD) [25] is a classical and widely used algorithm for machine learning [27] and deep learning [10] tasks. This is due to its easy interpretability and low computational cost. However, this approach has bottlenecks, such as sensitivity to noise, because the computation of the stochastic gradient can occur with large variance. One of the solutions to this problem is parallelization of gradient computations over local data, since collecting a larger set of stochastic realizations of the gradient leads to variance reduction. Such a method is known in the literature as Parallel SGD [21]. Now tasks require substantial computational resources due to the rapid development of artificial intelligence, making Parallel SGD advantageous over the classical SGD. Moreover, in such a branch of machine learning as federated learning [17], [13], [14], where data privacy [1] plays an important role, collecting information on a single device becomes basically impossible. As a consequence, parallelization of the optimization process is mandatory in these settings because of the nature of the problem.

Despite of the advantages of Parallel SGD, it has a bottleneck, which is expressed in terms of communication [17], [32]. Indeed, in reality, the communication process can be expensive compared to the complexity of computing local stochastic gradients. As a consequence, the method of Local SGD was proposed in [31] and [22]. While Parallel SGD is based on computing local stochastic gradients at a single point, Local SGD allows local steps to obtain separate point sequences for each computing device.

Local SGD has widely been studied in the standard form [31], [15], for variational inequalities [4], [6], in the decentralized paradigm [3], [16], and with some modifications like momentum [33], quantization [2], [24], variance reduction [20], [29], and preconditioning [7]. In addition, new interpretations of the local technique have emerged, which have resulted in methods known in the literature as FedProx [19], SCAFFOLD [14], and ProxSkip [23]. Moreover, papers related to the Hessian similarity of local functions [28], [12], [5], [18], the essence of which is local computation on one particular device, are worth mentioning.

As said before, some interpretations of local approaches are based on the similarity of data distributions across devices. Typically, two main cases of data distributions are distinguished, a homogeneous (or identical) and a heterogeneous one, which describe whether the data distributions across devices are similar or different, respectively. In the Local SGD analysis this plays a huge role, as knowledge about the similarity of data allows one to obtain better convergence rates, or explore some problem structures. Thus, for example, in [36] it was observed that for quadratic objective functions with homogeneous data distribution the convergence rate of Local SGD is independent of the communication frequency. Later, the proof of this fact was provided in [34]. This important observation aligns with the lower bounds established by the Local AC-SA algorithm [34], [8], indicating that Local AC-SA is minimax optimal for quadratic objectives.

A natural development in the study of this phenomenon is the use of the assumption of a Lipschitz Hessian, since it is the condition that specifies the nature of approximate quadraticity in the objective function. In this direction the papers [35] and [9] are worth noting. Nevertheless, the assumption of a Lipschitz Hessian is quite strong, and severely limits the class of problems under consideration. Therefore, our efforts are focused on finding a weaker requirement and analyzing Local SGD in the perspective of the new assumption.

It is also worth mentioning that the majority of the existing research, with the exception of [30], [16], and [11] operates under the assumption of the uniformly bounded variance of stochastic gradients. Nevertheless, this assumption is unrealistic in modern scenarios. For instance, it can easily be shown that even in the case of quadratic loss functions it is not fulfilled.

Therefore, another aim of our study is to analyze Local SGD for quadratic-like functions under some other conditions on stochasticity. In particular, we consider the noise with unbounded variance, where the variance of the stochastic gradient is proportional to the gradient norm. In the literature this assumption is called the strong growth condition; it was originally investigated under this name in [26].

To sum up, our contributions can be formulated as follows:

(a) we introduce the concept of approximate quadraticity, providing a framework to gauge the proximity of various functions to a quadratic form (that is, Assumption 1);
(b) we establish that some advancements in convergence rates for Local SGD on quadratic-like functions are achievable even in the absence of the Lipschitz property of the Hessian (Theorems 1, 2, and 3);
(c) we are the first, to the best of our knowledge, to demonstrate that such advancements persist under the strong growth condition (that is, Assumption 2);
(d) we present the best convergence rate known for Local SGD (Algorithm 1) in the strongly convex and smooth case (that is, Assumption 1) under the strong growth condition (that is, Assumption 2).

Table 1.Comparison of works around Local SGD

Reference	NLH	UG	Noise model	Convexity	Convergence rate
Stich [31]	${\color{green}\checkmark}$	${\color{red}{\boldsymbol\times}}$	Uniform	$\begin{array}{c} \mu=0\\ \mu > 0\vphantom{\biggl)} \end{array}$	$\begin{array}{l} \qquad\qquad\qquad - \\ \mathcal{O}\biggl(\dfrac{D^2}{K^3}+ \dfrac{\sigma^2}{\mu M T}+\dfrac{\kappa G^2}{\mu K^2}\biggr)\vphantom{\biggl)_{\sum}} \end{array}$
Yuan–Ma [35]	${\color{red}{\boldsymbol\times}}$	${\color{green}\checkmark}$	Uniform	$\begin{array}{c} \mu=0\vphantom{\biggl)}\\ \mu > 0\vphantom{\biggl)} \end{array}$	$\begin{array}{l} \widetilde{\mathcal{O}}\biggl(\dfrac{LD^2}{T}+ \dfrac{\sigma D}{\sqrt{MT}}+\dfrac{Q^{1/3} \sigma^{2/3}D^{5/3}}{T^{1/3} K^{1/3}}\biggr)\vphantom{\biggl)^{\sum}} \\ \widetilde{\mathcal{O}} \biggl(\text{exp. decay}+ \dfrac{\sigma^2}{\mu M T}+\dfrac{Q^2\sigma^4}{\mu^5 T^2 K^2}\biggr)\vphantom{\biggl)_{\sum}^{\sum}} \end{array}$
Khaled et al. [15]	${\color{green}\checkmark}$	${\color{green}\checkmark}$	Uniform	$\begin{array}{c} \mu=0\vphantom{\biggl)^{\sum}_{\sum}}\\ \mu > 0\vphantom{\biggl)_{\sum}} \end{array}$	$\begin{array}{l} \mathcal{O}\biggl(\dfrac{D^2}{\sqrt{MT}}+ \dfrac{\sigma^2}{L\sqrt{MT}}+\dfrac{\sigma^2 M}{LK}\biggr)\vphantom{\biggl)^{\sum}_{\sum}} \\ \widetilde{\mathcal{O}}\biggl(\dfrac{L D^2}{T^2}+ \dfrac{L\sigma^2}{\mu^2 M T}+\dfrac{L^2\sigma^2}{\mu^3 T K}\biggr)\vphantom{\biggl)_{\sum}} \end{array}$
Woodworth et al. [34]	${\color{green}\checkmark}$	${\color{green}\checkmark}$	Uniform	$\begin{array}{l} \mu=0\vphantom{\Biggl)^{\sum}}\\ \mu > 0\vphantom{\biggl)_{\sum}} \end{array}$	$\begin{array}{l} \mathcal{O}\biggl(\dfrac{L D^2}{T}+\dfrac{\sigma D}{\sqrt{MT}}+ \biggl(\dfrac{L\sigma^2 D^4}{TK} \biggr)^{1/3}\biggr)\vphantom{\Biggl)^{\sum}} \\ \widetilde{\mathcal{O}}\biggl(\text{exp. decay}+ \dfrac{\sigma^2}{\mu TM}+\dfrac{L \sigma^2}{\mu^2 TK}\biggr)\vphantom{\biggl)_{\sum}} \end{array}$
Spiridonoff et al. [30]	${\color{green}\checkmark}$	${\color{green}\checkmark}$	Strong growth	$\begin{array}{c} \mu=0 \\ \mu > 0 \\ \vphantom{\biggl)^{\sum}_{\sum}} \end{array}$	$\begin{array}{l} \qquad\qquad\qquad - \\ \mathcal{O} \biggl(\dfrac{L^2 D^2}{\mu^2 T^2}+ \dfrac{\rho L^4\log(TK^{-2}) D^2}{\mu^4 T^2}\\ \qquad\qquad\qquad+\dfrac{L \sigma^2}{\mu^2 M T}+ \dfrac{L^2 \sigma^2}{\mu^3 T K}\biggr)\vphantom{\biggl)^{\sum}_{\sum}} \end{array}$
This work	${\color{green}\checkmark}$	${\color{green}\checkmark}$	Uniform	$\begin{array}{c} \mu=0\vphantom{\biggl)^{\sum}}\\ \mu > 0\vphantom{\biggl)_{\sum}} \end{array}$	$\begin{array}{l} \mathcal{O} \biggl(\dfrac{L D^2}{T}+\dfrac{\sigma D}{\sqrt{MT}}+ \biggl(\dfrac{\boldsymbol{\varepsilon}L\sigma^2 D^4}{TK}\biggr)^{1/3}\biggr)\vphantom{\Biggl)^{\sum}} \\ \widetilde{\mathcal{O}}\biggl(\text{exp. decay}+\dfrac{\sigma^2}{\mu TM}+ \dfrac{\boldsymbol{\varepsilon} L\sigma^2}{\mu^2 TK}\biggr)\vphantom{\biggl)_{\sum}} \end{array}$
	${\color{green}\checkmark}$	${\color{green}\checkmark}$	Strong growth	$\begin{array}{c} \mu=0\vphantom{\biggl)} \\ \mu > 0\vphantom{\biggl)_{\sum}}\\ \vphantom{\biggl)} \end{array}$	$\begin{array}{l} \qquad\qquad\qquad - \\ \widetilde{\mathcal{O}} \biggl(\text{exp. decay}+\dfrac{\sigma^2}{\mu M T}\vphantom{\biggl)}\\ \qquad\qquad+\dfrac{\rho L^2 \sigma^2}{\mu^3 M T^2 K}+ \dfrac{\boldsymbol{\varepsilon}L\sigma^2}{\mu^2 TK}\biggr)\vphantom{\biggl)_{\sum}}\\ \end{array}$

The general notation: $\mathcal{O}$ omits constant factors; $\widetilde{\mathcal{O}}$ omits polylogarithmic and constant factors. In Table 1, NGH means ‘non-Lipschitz Hessian’ and UG means ‘unbounded gradient’; $D=\|x_0-x_*\|$ is the initial distance to the minimum; $\sigma$ is the variance of the stochastic gradient; $\mu$ is the strong convexity constant; $L$ is the Lipschitz gradient constant; $M$ is the number of workers; $K$ is the number of communication rounds; $T$ is the total number of iterations; $\kappa=L/\mu$ is the condition number. For a more detailed explanation, see Table 2.

In [31] $G$ denotes a uniform bound for the gradient norm, that is, $\|\nabla F(x)\| \leqslant G$ for all $x$.

In [35] $Q$ denotes the Lipschitz constant of the Hessian.

In [30], as well as in our work, $\rho$ denotes the strong growth parameter (for more details, see Assumption 2 in § 4).

In our work $\varepsilon$ shows how far the function is from a quadratic one (for all functions we have $\varepsilon \leqslant 1$ and for quadratic functions we have $\varepsilon=0$); $\mu=\mu_Q+\mu_R$, where $Q$ and $R$ are convex functions such that $Q+R=F$ (for more details, see Assumption 1 in section 4).

2. Related work

One of the first results of the Local SGD study was [31], where convergence in the strongly convex case was analyzed under the condition of bounded gradients, which appears to be quite restrictive. That research was continued in [15] and [34], which got rid of the assumption of bounded gradient norms and provided the best rates of convergence of a uniformly bounded noise model known for convex and $L$-smooth objectives under consideration. Moreover, [15] and [34] show a partially unimprovable result of [9] for both the identical and heterogeneous cases under different convexity assumptions. The work [30] generalizes the previous ones by replacing the assumption of a uniformly bounded variance by the strong growth condition, which captures a wider range of problems.

Regarding the Lipschitz Hessian assumption, one of the main papers is [35] where the study is done in the convex and strongly convex cases. It is worth saying that we need to suppose that the data distribution between the devices is identical in order to get the benefits of the optimization of quadratic objective functions. We provide these bounds in Table 1.

Our work achieves similar results in terms of the bounds obtained in [34] and outperforms the upper bounds in [30]. For more details, see Table 1.

3. Problem formulation

Let us introduce formally the problem we solve. Consider a scenario where $M$ devices $1,2,\dots,M$ solve collectively an optimization problem, aiming to find

$$ \begin{equation} x_*:=\arg\,\min_{x \in \mathbb{R}^d} \biggl(F(x)=\dfrac{1}{M}\sum_{m=1}^{M} F_m(x)\biggr), \end{equation} \tag{1} $$

where

$$ \begin{equation*} F_m(x)=\mathsf{E}_{z_m\sim\mathcal{D}_m}[F_m(x,z_m)] \end{equation*} \notag $$

is the loss function for the $m$th client. Here $\mathcal{D}_m$ denotes the data distribution for the $m$th client and $x$ denotes the parameters of the model.

Since the consideration of approximate quadraticity requires a homogeneous distribution of data across devices, $\mathcal{D}_m$ can be replaced by $\mathcal{D}$, which is some distribution of data characterizing each client. As a consequence, we obtain

$$ \begin{equation} F_1(x)=\cdots=F_M(x)=F(x), \end{equation} \tag{2} $$

where $F(x)$ is defined by

$$ \begin{equation*} F(x):=\mathsf{E}_{z \sim \mathcal{D}}[F(x,z)]. \end{equation*} \notag $$

Now we are ready to get acquatinted with the formal definition of Local SGD.

The algorithm is constructed as follows. Let there be $M$ clients communicating only with the server. At each iteration the $m$th client computes the local stochastic gradient $\boldsymbol{g}_t^m$ and makes an SGD step (Line 10). If there is a communication moment (see Lines 7–8), each node forwards the current point to the server, which averages them and sends this back to each client. The number of local steps is denoted by $H$, the number of communication rounds by $K$. Then the total number of iterations of the algorithm is $T=KH$.

It is important to highlight how we measure complexity in distributed optimization. Unlike classical optimization, where complexity is understood as the number of times the oracle is called, that is, computing the local gradients $\nabla F(x^m, z^m)$, we also consider communication complexity. Generally, we define it as the quantity of information we transmit or the number of communications in our case. Thus, in this work, we put the accent on reducing the number of communication rounds $K$ or increasing the number of SGD steps between exchanging information, denoted by $H$, while keeping the total number of SGD steps $T$ unchanged.

4. Settings and notation

In this section we introduce the set of assumptions under which we explore Local SGD.

Definition 1 (strong convexity and smoothness). The function $f$ is $\mu$-strongly convex and $L$-smooth if for all $x, y \in \mathbb{R}^d$,

$$ \begin{equation*} \dfrac{\mu}{2}\|x-y\|^2 \leqslant f(y)-f(x)-\langle\nabla f(x),y-x\rangle \leqslant \dfrac{L}{2}\|x-y\|^2. \end{equation*} \notag $$

Corollary 1. Suppose that the function $f$ is $0$-strongly convex and $L$-smooth. Then

$$ \begin{equation*} \dfrac{1}{2L}\|\nabla f(x)-\nabla f(y)\|^2 \leqslant f(y)-f(x)-\langle\nabla f(x),y-x\rangle. \end{equation*} \notag $$

After the above definition we introduce an assumption about approximate quadraticity.

Assumption 1 ($\varepsilon$-decomposition). The objective function can be decomposed as $F=Q+R$, where $Q$ is a quadratic $\mu_Q$-strongly convex and $L_Q$-smooth function, $R$ is a $\mu_R$-strongly convex and $L_R$-smooth function, and $F$ is an $L$-smooth function.

Under Assumption 1 we set $\varepsilon:=L_R/L$. We also denote the sum of $\mu_Q$ and $\mu_R$ by $\mu$, and $L/\mu$ by $\kappa$.

Corollary 2. Suppose that Assumption 1 holds. Then the following takes place:

(a) $\nabla Q$ is a linear function;
(b) $\varepsilon \leqslant 1$;
(c) $F$ is $\mu$-strongly convex function.

In order to show theoretical convergence guarantees, we introduce an assumption on the stochastic oracle: unbiasedness and a strong growth condition.

Assumption 2 (stochastic oracle). There exist constants $\sigma$ and $\rho$ such that

$$ \begin{equation*} \mathsf{E}_{z \sim \mathcal{D}}[\nabla F(x,z)]=F(x) \quad (\text{unbiased oracle}) \end{equation*} \notag $$

and

$$ \begin{equation*} \mathsf{E}_{z \sim \mathcal{D}}\|\nabla F(x,z)-\nabla F(x)\|^2 \leqslant \sigma^2+\rho\|\nabla F(x)\|^2 \quad (\text{strong growth condition}). \end{equation*} \notag $$

5. Convergence analysis

In this section we present results on the convergence of Algorithm 1 with different settings. We start with the investigation of the strongly convex case under bounded variance assumption.

Theorem 1 (bounded variance for $\mu > 0$). Suppose that Assumptions 1 and 2 hold for $\mu > 0$ and $\rho=0$. Then after $T=KH$ iterations of Algorithm 1, if $T \leqslant 2\kappa$, we choose $\gamma_t=1/(6L)$ and have

$$ \begin{equation*} \mathsf{E}[F(\widetilde{x}_T)-F(x_*)]= \mathcal{O}\biggl(L\exp\biggl(\dfrac{-\mu T}{L}\biggr)\|x_{0}-x_*\|^2+ \dfrac{\sigma^2}{\mu TM}+\dfrac{\varepsilon L\sigma^2}{\mu^2 TK}\biggr). \end{equation*} \notag $$

On the other hand, for $T > 2\kappa$ we choose the sequence $\{\gamma_t\}^T_{t=0}$ in the following way:

$$ \begin{equation*} \gamma_t=\begin{cases} \dfrac{1}{6L} & \textit{if}\ \ t \leqslant \dfrac{T}{2}\vphantom{\biggl)}\,, \\ \dfrac{2}{\mu(\xi+t)} & \textit{if}\ \ t > \dfrac{T}{2}\,, \end{cases} \end{equation*} \notag $$

where $\xi=12 L/\mu-T/2$; for this choice of steps we have

$$ \begin{equation*} \mathsf{E}[F(\widetilde{x}_T)-F(x_*)]=\widetilde{\mathcal{O}} \biggl(L\exp\biggl(\dfrac{-\mu T}{L}\biggr)\|x_{0}-x_*\|^2+ \dfrac{\sigma^2}{\mu TM}+\dfrac{\varepsilon L \sigma^2}{\mu^2 TK}\biggr). \end{equation*} \notag $$

Remark 1. Hereafter the point $\widetilde{x}_T$ denotes the weighted average of the points $\{\overline{x}_t\}_{t=0}^{T-1}$. Moreover, weights differs from one theorem to another. For more details, see Appendices E, F, and G.

Remark 2. This result correlates with [35], where the Lipschitz property of the Hessian was assumed. Indeed, if $\varepsilon=0$, then $Q=0$ (see Table 1). However, we can see an improvement in terms of the power of $D$ in the final bound: $4/3$ in our paper instead of $5/3$. Moreover, this result is consistent with the analysis in [34] except the last term, which takes into account the frequency of communications. In the case of a quadratic problem (that is, when $\varepsilon=0$) there is no dependence on $H$; then the last term goes to zero, whereas [34] does not consider such an effect.

Next we introduce a convergence result for the convex case under bounded variance.

Theorem 2 (bounded variance for $\mu=0$). Suppose that Assumptions 1 and 2 hold for $\mu=0$ and $\rho=0$. Then after $T=KH$ iterations of Algorithm 1, with the choice of stepsizes as

$$ \begin{equation*} \begin{aligned} \, \gamma_t=\gamma=\begin{cases} \min\biggl\{\dfrac{1}{6L}\,,\dfrac{\|r_0\|\sqrt{M}}{\sigma\sqrt{T}}\biggr\} &\textit{if}\ H=1 \ \textit{or}\ M=1, \\ \min\biggl\{\dfrac{1}{6L}\,,\dfrac{\|r_0\|\sqrt{M}}{\sigma\sqrt{T}}\,, \biggl(\dfrac{\|r_0\|^2}{L_R \sigma^2TH}\biggr)^{1/3}\biggr\} \vphantom{\biggl\}^{\sum^{A^A}}} &\textit {otherwise}, \end{cases} \end{aligned} \end{equation*} \notag $$

we obtain

$$ \begin{equation*} \mathsf{E}[F(\widetilde{x}_T)-F(x_*)]=\mathcal{O}\biggl(\dfrac{L\|r_0\|^2}{T}+ \dfrac{\sigma\|r_0\|}{\sqrt{MT}}+ \biggl(\dfrac{\varepsilon L\sigma^2\|r_0\|^4}{TK}\biggr)^{1/3}\,\biggr). \end{equation*} \notag $$

Remark 3. This result is also consistent with the analysis in [34], except for the last term. The description of the difference is the same as in Remark 2. Nevertheless, our bound is worse than in [35] if we compare the terms related to approximate quadraticity. This is due to different assumptions on the stochastic oracle: [35] considers a bounded fourth central moment of the stochastic oracle, whereas we obtain a bounded second central moment when $\rho=0$.

Finally, we present theoretical guarantees of convergence for a novel case: the strongly convex case under unbounded variance for near-quadratic functions.

Theorem 3 (unbounded variance for $\mu > 0$). Suppose Assumptions 1 and 2 hold for $\mu > 0$. Then, after $T=KH$ iterations of Algorithm 1, where

$$ \begin{equation*} \gamma_t=\gamma=\min\biggl\{\dfrac{\mu}{3\rho L^2}\,,\dfrac{1}{6L}\,, \dfrac{\sqrt{\mu}}{\sqrt{6CH\rho L^2}}\,, \dfrac{\log(\max\{2,\mu^2\|r_0\|^2T^2M/\sigma^2\})}{2\mu T}\biggr\} \end{equation*} \notag $$

and $C=\mu+L_R$, we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}[F(\widetilde{x}_t)-F(x_*)]&=\widetilde{\mathcal{O}}\biggl(\|r_0\|^2 \max\biggl\{\dfrac{\rho L^2}{\mu}\,,L, \dfrac{\sqrt{CH\rho L^2}}{\sqrt{\mu}}\biggr\} \\ &\qquad\times\exp\biggl(-\mu T\min\biggl\{\dfrac{\mu}{\rho L^2}\,, \dfrac{1}{L}\,,\dfrac{\sqrt{\mu}}{\sqrt{CH\rho L^2}}\biggr\}\biggr) \\ &\qquad+\dfrac{\sigma^2}{\mu M T}+\dfrac{\rho L^2H\sigma^2}{M\mu^3T^3}+ \dfrac{\varepsilon LH\sigma^2}{\mu^2 T^2}\biggr). \end{aligned} \end{equation*} \notag $$

Remark 4. A modification of Theorem 3 is also correct. Indeed, according to the bound above, $H/T^3$ can be bounded by $1/T^2$ in the third term, which leads to independence of $H$, with the same rate of convergence as in the existing results [30].

Remark 5. The result of Theorem 3 outperforms the existing ones [30]. In the absence of stochasticity we have an exponential decrease instead of $\mathcal{O}(1/T^2)$, and the other summands are better in terms of multiplicative factors, in particular, $L/\mu$.

6. Discussion

In our paper we introduce the notion of approximate quadraticity by using the assumption of $\varepsilon$-decomposition (see Assumption 1). It allows us to characterize the similarity of the optimized function to a quadratic one. Several questions arise naturally: is Assumption 1 consistent with reality? How does the parameter $\varepsilon$ reflect the near-quadratic behaviour of the target functions? To understand this let us express $\mathsf{E}\|\overline{x}_{t+1}-x_*\|^2$ in terms of $\mathsf{E}\|\overline{x}_{t}-x_*\|^2$ instead of solving a recurrence relation and analyzing $\mathsf{E}\|\overline{x}_T-x_*\|^2$ or $\mathsf{E}[F(\overline{x}_T)-F(x_*)]$. As can be observed from Theorem 1 (see Appendix E), we have

$$ \begin{equation} \mathsf{E}\|\overline{x}_{t+1}-x_*\|^2\leqslant (1-\gamma\lambda)\mathsf{E}\|\overline{x}_t-x_*\|^2+ \dfrac{\gamma^2\sigma^2}{M}+2\varepsilon L(H-1)\gamma^3\sigma^2. \end{equation} \tag{3} $$

In fact, the parameter $\varepsilon$ can vary from the current point of the iterative process of Algorithm 1, or, more precisely,

$$ \begin{equation*} \varepsilon \sim \dfrac{1}{M}\sum_{m=1}^M \varepsilon(x_t^m). \end{equation*} \notag $$

Therefore, considering $\varepsilon$ as a function of some set of points $\mathbb{S} \subset \mathbb{R}^d$, we can observe that for functions with Lipschitz Hessian we know that the rate of decay of $\varepsilon$ is high. The graphs in Fig. 1 illustrate the rate of decay of $\varepsilon(x)$ (as a maximum $\varepsilon(y)$ for all $\|y\| \leqslant \|x\|$) for some functions.

Figure 1.The decay of $\varepsilon$ when approaching minima. (a) LogCoshLoss: $y=\log(\cosh x)$. (b) LogLoss with $\ell_2$-regularization (4). (c) The lower-bound function $\mathcal{F}$ (5).

In Fig. 1 (b), by LogLoss with $\ell_2$-regularization, we mean

$$ \begin{equation} y=-\log\biggl(\dfrac{1}{1+e^{-x}}\biggr)+0.03 x^2. \end{equation} \tag{4} $$

And in Fig. 1 (c) we analyze a well-known function that yields lower bound estimates for Local SGD [9]:

$$ \begin{equation} \mathcal{F}(x)=\begin{cases} \dfrac{x^2}{5} & \text{if}\ \ x < 0\vphantom{\biggl\}}; \\ x^2 & \text{if}\ \ x \geqslant 0. \end{cases} \end{equation} \tag{5} $$

Here we notice that for functions in Fig. 1 (a) and Fig. 1 (b), $\varepsilon$ decreases rapidly (because they satisfy the Lipschitz Hessian assumption). Due to the decay of $\varepsilon$ we can increase the number of local steps $H$ or reduce the number of communications $R$, which means enhancing the communication complexity.

However, for function in Fig. 1 (c), $\varepsilon$ remains constant in any neighbourhood of its optimum. This can contribute to a better understanding of why $\mathcal{F}$ yields lower bounds for Local SGD.

From this example we can observe that (3) provides valuable insights into the convergence rate for various types of functions. However, abandoning the assumption of a Lipschitz Hessian means that we cannot estimate the rate of decay of $\varepsilon$. Therefore, conducting a meaningful asymptotic analysis while maintaining generality seems impossible.

Nevertheless, this remains an open question for future research. More specifically, required is the construction of a theory that takes into account the dependence of the parameter $\varepsilon$ on the current point $x$ .

Appendix A. Notation

To begin with, let us introduce some useful shorthand notation. By the uppercase Latin letters $F$, $Q$, and $R$ we denote functions. The corresponding stochastic gradients are represented by the bold lowercase Latin letters $\boldsymbol{g}$, $\boldsymbol{q}$, and $\boldsymbol{r}$, and the straight lowercase Latin letters $g$, $q$, and $r$ denote the expectations of the gradients. A bar over a character (for example, $\overline{g}$) indicates that we take the average of this quantity over all devices.

We summarize this notation and other symbols used in proofs for short in Table 2.

Table 2.Notation summary

Notation	Meaning
$\overline{x}_t$	$\dfrac{1}{M}\displaystyle\sum_{m=1}^{M}x^m_t$, the average of the weights over all devices at iteration $t\vphantom{\Biggl\}^{\sum}_{\sum}}$
$\overline{F}_t$, $\overline{Q}_t$, $\overline{R}_t$	$\dfrac{1}{M}\displaystyle\sum_{m=1}^{M} \vphantom{\Biggr\}^{\sum}_{\sum}} F(x^m_t),\ \dfrac{1}{M}\displaystyle \sum_{m=1}^{M} Q(x^m_t),\ \dfrac{1}{M}\displaystyle\sum_{m=1}^{M} R(x^m_t)$, the average value of a function
$\boldsymbol{g}^m_t$, $\boldsymbol{q}^m_t$, $\boldsymbol{r}^m_t$	$\nabla F(x^m_t,z^m_t)$, $\nabla Q(x^m_t,z^m_t)$, $\nabla R(x^m_t,z^m_t)$, the corresponding stochastic gradients at iteration $t$ on device $m$
$\overline{\boldsymbol{g}}_t$, $\overline{\boldsymbol{q}}_t$, $\overline{\boldsymbol{r}}_t$	$\dfrac{1}{M}\displaystyle\sum_{m=1}^{M}\boldsymbol{g}^m_t\vphantom{\Biggl\}^{\sum}_{\sum}}$, $\dfrac{1}{M}\displaystyle\sum_{m=1}^{M}\boldsymbol{q}^m_t$, $\dfrac{1}{M}\displaystyle\sum_{m=1}^{M}\boldsymbol{r}^m_t$, the average stochastic gradients at iteration $t$
$g^m_t$, $q^m_t$, $r^m_t$	$\nabla F(x^m_t)$, $\nabla Q(x^m_t)$, $\nabla R(x^m_t)$, the expected values of stochastic gradients at iteration $t$ on device $m$ (namely, $\mathsf{E}[\boldsymbol{g}^m_t]$, $\mathsf{E}[\boldsymbol{q}^m_t]$, $\mathsf{E}[\boldsymbol{r}^m_t]$)
$\overline{{g}}_t$, $\overline{{q}}_t$, $\overline{{r}}_t$	$\dfrac{1}{M}\displaystyle\sum_{m=1}^{M}g^m_t\vphantom{\Biggl\}^{\sum}_{\sum}}$, $\dfrac{1}{M}\displaystyle\sum_{m=1}^{M}q^m_t$, $\dfrac{1}{M}\displaystyle\sum_{m=1}^{M} r^m_t$, the average expected values of gradients at iteration $t$
$V_t$	$\dfrac{1}{M}\displaystyle\sum_{m=1}^{M}\\|x^m_t-\overline{x}_t\\|^2\vphantom{\Biggl\}^{\sum}_{\sum}}$, the mean deviation of $x^m_t$ from $\overline{x}_t$
$\\|r_t\\|$	$\\|\overline{x}_t-x_*\\|$, the distance between the average weights and the optimum at iteration $t$
$\kappa$	$\dfrac{L}{\mu}\vphantom{\biggl\}_{\sum}}$, the condition number

Appendix B. Basic facts

Lemma 1 (Young’s inequality). For all $u,v\in \mathbb{R}^d$ and any positive $w$ Young’s inequality holds:

$$ \begin{equation} 2\langle u,v\rangle \leqslant w\|u\|^2+\dfrac{1}{w}\|v\|^2. \end{equation} \tag{6} $$

Lemma 2 (Jensen’s inequality). For any convex function $g\colon \mathbb{R}^d \to \mathbb{R}$, all $\{u_i\}_{i=1}^n$, $u_i\in\mathbb{R}$, and any $\{p_i\}_{i=1}^n$, $p_i\in\mathbb{R}_{\geqslant0}$, such that $\displaystyle\sum_{i=1}^n p_i=1$ Jensen’s inequality holds:

$$ \begin{equation} g\biggl(\,\sum_{i=1}^n p_i u_i\biggr) \leqslant \sum_{i=1}^n p_i g(u_i). \end{equation} \tag{7} $$

Appendix C. Descent lemma

Before presenting the proofs of the claims, let us first establish some technical results.

Lemma 3. For the averaged variables $\overline x_t$ obtained at iterations of Algorithm 1 for problem (1) under Assumption 1 the following holds:

$$ \begin{equation*} \begin{aligned} \, \|\overline{x}_t-x_*-\gamma_{t}\overline{g}_t\|^2&=\|\overline{x}_t-x_*\|^2+ \gamma_{t}^2\|\nabla Q(\overline x_t)+\overline{r}_t-\nabla Q(x_*)- \nabla R(x_*)\|^2 \\ &\qquad-2\gamma_{t}\langle\overline{x}_t-x_*,\nabla Q(\overline x_t)\rangle -2\gamma_{t}\langle\overline{x}_t-x_*,\overline{r}_t\rangle. \end{aligned} \end{equation*} \notag $$

Proof. To prove this lemma it is enough to use simple algebraic transformations and the notation introduced in Table 2 for the functions $Q$ and $R$ from Assumption 1. We have

$$ \begin{equation*} \begin{aligned} \, &\|\overline{x}_t-x_*-\gamma_{t}\overline{g}_t\|^2= \|\overline{x}_t-x_*\|^2+\gamma_{t}^2\|\overline{g}_t\|^2- 2\gamma_{t}\langle\overline{x}_t-x_*,\overline{g}_t\rangle \\ &=\|\overline{x}_t-x_*\|^2+\gamma_{t}^2\biggl\|\dfrac{1}{M} \sum_{m=1}^{M}{\nabla F(x^m_t)}\biggr\|^2-2\gamma_{t} \biggl\langle\overline{x}_t-x_*, \dfrac{1}{M}\sum_{m=1}^{M}\nabla F(x^m_t)\biggr\rangle \\ &=\|\overline{x}_t-x_*\|^2+ \gamma_{t}^2 \biggl\|\dfrac{1}{M}\sum_{m=1}^{M}\nabla F(x^m_t)-\nabla F(x_*)\biggr\|^2 \\ &\qquad-2\gamma_{t}\biggl\langle\overline{x}_t-x_*, \dfrac{1}{M}\sum_{m=1}^{M}\nabla F(x^m_t)\biggr\rangle \\ &=\|\overline{x}_t-x_*\|^2+\gamma_{t}^2\biggl\|\dfrac{1}{M}\sum_{m=1}^{M} \bigl[\nabla Q(x^m_t)+\nabla R(x^m_t)-\nabla Q(x_*)- \nabla R(x_*)\bigr]\biggr\|^2 \\ &\qquad-2\gamma_{t}\biggl\langle\overline{x}_t-x_*, \dfrac{1}{M}\sum_{m=1}^{M}\nabla Q(x^m_t)\biggr\rangle-2\gamma_{t} \biggl\langle\overline{x}_t-x_*, \dfrac{1}{M}\sum_{m=1}^{M}\nabla R(x^m_t)\biggr\rangle \\ &=\|\overline{x}_t-x_*\|^2+\gamma_{t}^2\biggl\|\dfrac{1}{M}\sum_{m=1}^{M} \bigl[q^m_t+r^m_t-\nabla Q(x_*)-\nabla R(x_*)\bigr]\biggr\|^2 \\ &\qquad-2\gamma_{t}\biggl\langle\overline{x}_t-x_*, \dfrac{1}{M}\sum_{m=1}^{M} q^m_t\biggr\rangle-2\gamma_{t} \biggl\langle\overline{x}_t-x_*,\dfrac{1}{M}\sum_{m=1}^{M}r^m_t\biggr\rangle \\ &=\|\overline{x}_t-x_*\|^2+\gamma_{t}^2\biggl\|\dfrac{1}{M}\sum_{m=1}^{M}q^m_t+ \overline{r}_t-\nabla Q(x_*)-\nabla R(x_*)\biggr\|^2 \\ &\qquad-2\gamma_{t}\biggl\langle\overline{x}_t-x_*, \dfrac{1}{M}\sum_{m=1}^{M}{q}^m_t\biggr\rangle-2\gamma_{t} \langle\overline{x}_t-x_*,\overline{r}_t\rangle. \end{aligned} \end{equation*} \notag $$

It remains to take into account that the function $Q$ is quadratic. In particular, we use that $\dfrac{1}{M}\displaystyle\sum_{m=1}^{M}q^m_t=\nabla Q(\overline x_t)$. $\Box$

Next, we proceed to work with separate terms of the expression in Lemma 3.

Lemma 4. For iterations of Algorithm 1 for problem (1) under Assumption 1 the following inequality holds for any positive $\zeta$:

$$ \begin{equation*} \begin{aligned} \, &\|\nabla Q(\overline x_t)+\overline{r}_t-\nabla Q(x_*)-\nabla R(x_*)\|^2 \\ &\qquad\leqslant 2L_Q(1+\zeta)\bigl(Q(\overline{x}_t)-Q(x_*)- \langle\nabla Q(x_*),\overline{x}_t-x_*\rangle\bigr) \\ &\qquad\qquad+2L_R\biggl(1+\dfrac{1}{\zeta}\biggr)\bigl(\overline{R}_t-R(x_*)- \langle\nabla R(x_*),\overline{x}_t-x_*\rangle\bigr). \end{aligned} \end{equation*} \notag $$

Proof. By Young’s inequality (6), for any positive $\zeta$ we have

$$ \begin{equation*} \begin{aligned} \, &\|\nabla Q(\overline x_t)+\overline{r}_t-\nabla Q(x_*)-\nabla R(x_*)\|^2 \\ &\qquad=\|\nabla Q(\overline x_t)-\nabla Q(x_*)\|^2+ \|\overline{r}_t-\nabla R(x_*)\|^2 \\ &\qquad\qquad+2\langle\nabla Q(\overline x_t)-\nabla Q(x_*), \overline{r}_t-\nabla R(x_*)\rangle \\ &\qquad\leqslant \|\nabla Q(\overline x_t)-\nabla Q(x_*)\|^2+ \|\overline{r}_t-\nabla R(x_*)\|^2 \\ &\qquad\qquad+\zeta\|\nabla Q(\overline x_t)-\nabla Q(x_*)\|^2 +\dfrac{1}{\zeta}\|\overline{r}_t-\nabla R(x_*)\|^2 \\ &\qquad=(1+\zeta)\|\nabla Q(\overline x_t)-\nabla Q(x_*)\|^2+ \biggl(1+\dfrac{1}{\zeta}\biggr)\|\overline{r}_t-\nabla R(x_*)\|^2 \\ &\qquad=(1+\zeta)\|\nabla Q(\overline x_t)-\nabla Q(x_*)\|^2 \\ &\qquad\qquad+\biggl(1+\dfrac{1}{\zeta}\biggr)\biggl\|\dfrac{1}{M}\sum_{m=1}^M \bigl[\nabla R(x^m_t)-\nabla R(x_*)\bigr]\biggr\|^2 \\ &\qquad\leqslant (1+\zeta)\|\nabla Q(\overline x_t)-\nabla Q(x_*)\|^2 \\ &\qquad\qquad+\biggl(1+\dfrac{1}{\zeta}\biggr) \cdot \dfrac{1}{M}\sum_{m=1}^M \|\nabla R(x^m_t)-\nabla R(x_*)\|^2. \end{aligned} \end{equation*} \notag $$

Here we have also used the notation for $\overline r_t$ and the convexity of the norm (7). Using the smoothness of $Q$ and $R$ (Assumption 1 and Corollary 1) one can obtain

$$ \begin{equation*} \begin{aligned} \, &\|\nabla Q(\overline x_t)+\overline{r}_t-\nabla Q(x_*)-\nabla R(x_*)\|^2 \\ &\qquad\leqslant 2(1+\zeta)L_Q\bigl(Q(\overline x_t)-Q (x_*)- \langle\nabla Q(x_*),\overline{x}_t-x_*\rangle\bigr) \\ &\qquad\qquad+2\biggl(1+\dfrac{1}{\zeta}\biggr)L_R \cdot \dfrac{1}{M}\sum_{m=1}^M\bigl[R(x^m_t)-R(x_*)- \langle\nabla R(x_*),x^m_t-x_*\rangle\bigr]. \end{aligned} \end{equation*} \notag $$

To finish the proof one just need to use the notation $\overline R_t$. $\Box$

Lemma 5. For iterations of Algorithm 1 for problem (1) under Assumption 1 the following holds:

$$ \begin{equation*} -2\langle\overline{x}_t-x_*,\nabla Q(\overline x_t)\rangle\leqslant 2Q(x_*)-2Q(\overline{x}_t)-\mu_Q\|\overline{x}_t-x_*\|^2. \end{equation*} \notag $$

Proof. It is enough to use the $\mu_Q$-convexity of $Q$. $\Box$

Lemma 6. For iterations of Algorithm 1 for problem (1) under Assumption 1 the following holds:

$$ \begin{equation*} \begin{aligned} \, -2\langle\overline{x}_t-x_*,\overline{r}_t\rangle&\leqslant -(\overline{R}_t-R (x_*))+2 L_R V_t-\mu_R\|x_*-\overline{x}_t\|^2 \\ &\qquad-\langle\nabla R(x_*),\overline{x}_t-x_*\rangle. \end{aligned} \end{equation*} \notag $$

Proof. Using the notation $r^m_t$ we obtain

$$ \begin{equation} \begin{aligned} \, 2\langle x_*-\overline{x}_t,\overline{r}_t\rangle&= \dfrac{2}{M}\sum_{m=1}^{M}\langle x_*-x^m_t+x^m_t-\overline{x}_t,r^m_t\rangle \nonumber \\ &=\dfrac{2}{M}\sum_{m=1}^{M}\langle x_*-x^m_t,r^m_t\rangle +\dfrac{2}{M}\sum_{m=1}^{M}\langle x^m_t-\overline{x}_t,r^m_t\rangle. \end{aligned} \end{equation} \tag{8} $$

We estimate the first term using the $\mu_R$-strong convexity and Jensen’s inequality (7):

$$ \begin{equation} \begin{aligned} \, \dfrac{2}{M}\sum_{m=1}^{M}\langle x_*-x^m_t,r^m_t\rangle&\leqslant \dfrac{1}{M}\sum_{m=1}^{M}\bigl(2R(x_*)-2R(x^m_t)-\mu_R\|x_*-x^m_t\|^2\bigr) \nonumber \\ &\leqslant \dfrac{1}{M}\sum_{m=1}^{M}\bigl(2R(x_*)-2R(x^m_t)- \mu_R\|x_*-\overline{x}_t\|^2\bigr) \nonumber \\ &=2R(x_*)-2\overline{R}_t-\mu_R\|x_*-\overline{x}_t\|^2. \end{aligned} \end{equation} \tag{9} $$

To the second part we apply Young’s inequality (6):

$$ \begin{equation} \begin{aligned} \, \dfrac{2}{M}\sum_{m=1}^{M}\langle x^m_t-\overline{x}_t,r^m_t\rangle&= \dfrac{2}{M}\sum_{m=1}^{M}\langle x^m_t-\overline{x}_t, r^m_t-\nabla R(x_*)\rangle \nonumber \\ &\qquad+\dfrac{2}{M}\sum_{m=1}^{M}\langle x^m_t-\overline{x}_t, \nabla R( x_*)\rangle \nonumber \\ &=\dfrac{2}{M}\sum_{m=1}^{M}\langle x^m_t-\overline{x}_t, r^m_t-\nabla R(x_*)\rangle \nonumber \\ &\leqslant \dfrac{2}{M}\sum_{m=1}^{M}L_R\|x^m_t-\overline{x}_t\|^2+ \dfrac{1}{M}\sum_{m=1}^{M}\dfrac{1}{2L_R}\|r^m_t-\nabla R(x_*)\|^2 \nonumber \\ &=2L_R V_t+\dfrac{1}{M}\sum_{m=1}^{M}\dfrac{1}{2L_R}\|r^m_t-\nabla R(x_*)\|^2. \end{aligned} \end{equation} \tag{10} $$

Here we have also used the notation $V_t$. By Assumption 1 on the smoothness of $R$ we have

$$ \begin{equation} \|r^m_t-\nabla R(x_*)\|^2 \leqslant 2L_R\bigl(R(x^m_t)-R(x_*)- \langle\nabla R(x_*),x^m_t-x_*\rangle\bigr). \end{equation} \tag{11} $$

Substituting (11) into (10) we complete the second part:

$$ \begin{equation} \begin{aligned} \, \dfrac{2}{M}\sum_{m=1}^{M}\langle x^m_t-\overline{x}_t,r^m_t\rangle&\leqslant 2L_R V_t+\dfrac{1}{M}\sum_{m=1}^{M}\bigl[R(x^m_t)-R (x_*)- \langle\nabla R(x_*),x^m_t-x_*\rangle\bigr] \nonumber \\ &\leqslant 2L_R V_t+\bigl(\overline{R}_t-R (x_*)- \langle\nabla R(x_*),\overline{x}_t-x_*\rangle\bigr). \end{aligned} \end{equation} \tag{12} $$

Combining together (8), (9), and (12) we obtain

$$ \begin{equation*} \begin{aligned} \, -2\langle\overline{x}_t-x_*,\overline{r}_t\rangle&\leqslant 2R(x_*)-2\overline{R}_t-\mu_R\|x_*-\overline{x}_t\|^2+2L_R V_t \\ &\qquad+\bigl(\overline{R}_t-R(x_*)- \langle\nabla R(x_*),\overline{x}_t-x_*\rangle\bigr) \\ &=-\bigl(\overline{R}_t-R(x_*)\bigr)+2L_R V_t-\mu_R\|x_*-\overline{x}_t\|^2 \\ &\qquad-\langle\nabla R (x_*),\overline{x}_t-x_*\rangle.\quad \Box \end{aligned} \end{equation*} \notag $$

Lemma 7. For iterations of Algorithm 1 with $\gamma_t \leqslant 1/(6L)$ for problem (1) under Assumption 1 the following holds:

$$ \begin{equation*} \|\overline{x}_t-x_*-\gamma_{t}\overline{g}_t\|^2\leqslant (1-\gamma_t\mu)\|\overline{x}_t-x_*\|^2- \dfrac{\gamma_t}{6}\bigl(F(\overline{x}_t)-F(x_*)\bigr)+ 2\gamma_{t}L_R V_t. \end{equation*} \notag $$

Proof. Substituting the results of Lemmas 4, 5, and 6 into Lemma 3 and making some simple algebra we obtain

$$ \begin{equation*} \begin{aligned} \, \|\overline{x}_t-x_*-\gamma_{t} \overline{g}_t\|^2 &\leqslant \|\overline{x}_t-x_*\|^2 \\ &\qquad+2\gamma_{t}^2L_Q(1+\zeta)\bigl(Q(\overline{x}_t)- Q (x_*)-\langle\nabla Q(x_*),\overline{x}_t-x_*\rangle\bigr) \\ &\qquad+2\gamma_{t}^2 L_R\biggl(1+\dfrac{1}{\zeta}\biggr) \bigl(\overline{R}_t-R (x_*)- \langle\nabla R(x_*),\overline{x}_t-x_*\rangle\bigr) \\ &\qquad+\gamma_{t}\bigl(2Q(x_*)- 2Q(\overline{x}_t)-\mu_Q\|\overline{x}_t-x_*\|^2\bigr) \\ &\qquad-\gamma_{t}\bigl(\overline{R}_t-R(x_*)+2L_R V_t- \mu_R\|x_*-\overline{x}_t\|^2\bigr) \\ &\qquad-\gamma_{t}\langle\nabla R(x_*),\overline{x}_t-x_*\rangle \\ &=(1-\gamma_{t}\mu_Q-\gamma_{t}\mu_R)\|\overline{x}_t-x_*\|^2+ 2\gamma_{t} L_R V_t \\ &\qquad+2\gamma_t[\gamma_{t}L_Q(1+\zeta)-1] \bigl(Q(\overline{x}_t)-Q (x_*)\bigr) \\ &\qquad+2\gamma_{t}\biggl[\gamma_{t}L_R\biggl(1+\dfrac{1}{\zeta}\biggr)+ \dfrac{1}{2}-1\biggr](\overline{R}_t-R(x_*)) \\ &\qquad-2\gamma_t[\gamma_{t} L_Q(1+\zeta)] \langle\nabla Q(x_*),\overline{x}_t-x_*\rangle \\ &\qquad-2\gamma_t\biggl[\gamma_{t}L_R\biggl(1+\dfrac{1}{\zeta}\biggr)+ \dfrac{1}{2}\biggr]\langle\nabla R(x_*),\overline{x}_t-x_*\rangle. \end{aligned} \end{equation*} \notag $$

We want to obtain $\gamma_{t}L_R(1+1/\zeta)+1/2=\gamma_{t}L_Q(1+\zeta)$. One can note that this is equivalent to finding a positive root of the equation

$$ \begin{equation*} L_Q \zeta^2+\biggl(L_Q-L_R-\dfrac{1}{2\gamma_{t}}\biggr)\zeta-L_R=0. \end{equation*} \notag $$

For instance, we take

$$ \begin{equation*} \zeta=\dfrac{1}{2L_Q}\biggl(-\biggl(L_Q-L_R-\dfrac{1}{2\gamma_{t}}\biggr)+ \sqrt{\biggl(L_Q-L_R-\dfrac{1}{2\gamma_{t}}\biggr)^2+4L_Q L_R}\,\biggr)>0. \end{equation*} \notag $$

For this $\zeta$ we have

Taking into account the optimality condition $\nabla R(x_*)+\nabla Q(x_*)=0$, we obtain

$$ \begin{equation} \begin{aligned} \, \|\overline{x}_t-x_*-\gamma_{t}\overline{g}_t\|^2&\leqslant (1-\gamma_{t}\mu_Q-\gamma_{t}\mu_R)\|\overline{x}_t-x_*\|^2+ 2\gamma_{t}L_R V_t \nonumber \\ &\qquad+2\gamma_t[\gamma_{t}L_Q(1+\zeta)-1](Q(\overline{x}_t)-Q(x_*)) \nonumber \\ &\qquad+ 2\gamma_{t}[\gamma_{t}L_Q(1+\zeta)-1](\overline{R}_t-R(x_*)). \end{aligned} \end{equation} \tag{13} $$

Next we prove that $\gamma_{t} L_Q(1+\zeta) \leqslant 1$ for $\gamma_t \leqslant 1/(6L)$, where $L=L_Q+L_R$:

$$ \begin{equation*} \begin{aligned} \, \gamma_{t} L_Q(1+\zeta)&=\gamma_{t} L_Q \Biggl[1+\dfrac{1}{2L_Q}\biggl(-\biggl(L_Q-L_R- \dfrac{1}{2\gamma_{t}}\biggr)\\ &\qquad\qquad\qquad+ \sqrt{\biggl(L_Q-L_R-\dfrac{1}{2\gamma_{t}}\biggr)^2+4 L_Q L_R}\biggr)\Biggr] \\ &=\dfrac{\gamma_{t}}{2}\Biggl[2L_Q-\biggl(L_Q-L_R-\dfrac{1}{2\gamma_{t}}\biggr)+ \sqrt{\biggl(L_Q-L_R-\dfrac{1}{2\gamma_{t}}\biggr)^2+4 L_Q L_R}\,\Biggr] \\ &\leqslant\dfrac{\gamma_{t}}{2}\biggl[L_Q+L_R+\dfrac{1}{2\gamma_{t}}+ \biggl|L_Q-L_R-\dfrac{1}{2\gamma_{t}}\biggr|+\sqrt{4L_Q L_R}\,\biggr] \\ & \leqslant\dfrac{\gamma_{t}}{2}\biggl[L_Q+L_R+\dfrac{1}{2\gamma_{t}}+ L+\dfrac{1}{2\gamma_{t}}+\sqrt{4L^2}\,\biggr] \\ & \leqslant\dfrac{\gamma_{t}}{2}\biggl[5L+\dfrac{1}{\gamma_{t}}\biggr]\leqslant \dfrac{\gamma_{t}}{2}\biggl[\dfrac{5}{6\gamma_{t}}+\dfrac{1}{\gamma_{t}}\biggr] \\ &=\dfrac{5}{12}+\dfrac{1}{2} \leqslant \dfrac{11}{12}\,. \end{aligned} \end{equation*} \notag $$

This fact shows that $\gamma_{t}L_Q(1+\zeta)-1 \leqslant -1/12$, and therefore one can use Jensen’s inequality (7) to deduce the relation

$$ \begin{equation} \begin{aligned} \, [\gamma_{t}L_Q(1+\zeta)-1]\overline{R}_t&=[\gamma_{t}L_Q(1+\zeta)-1] \cdot \dfrac{1}{M}\sum_{m=1}^M R(x^m_t) \nonumber \\ &\leqslant [\gamma_{t}L_Q(1+\zeta)-1] \cdot R\biggl(\dfrac{1}{M}\sum_{m=1}^M x^m_t\biggr) \nonumber \\ &=[\gamma_{t}L_Q(1+\zeta)-1]R(\overline x_t). \end{aligned} \end{equation} \tag{14} $$

Combining (13) and (14) we obtain

$$ \begin{equation*} \begin{aligned} \, \|\overline{x}_t-x_*-\gamma_{t} \overline{g}_t\|^2&\leqslant (1-\gamma_{t}\mu_Q-\gamma_{t}\mu_R)\|\overline{x}_t-x_*\|^2+2\gamma_{t}L_RV_t \\ &\qquad+2\gamma_t[\gamma_{t}L_Q(1+\zeta)-1](Q(\overline{x}_t)-Q (x_*)) \\ &\qquad+2\gamma_{t}[\gamma_{t}L_Q(1+\zeta)-1](R(\overline x_t)-R(x_*)) \\ &=(1-\gamma_{t}\mu_Q-\gamma_{t}\mu_R)\|\overline{x}_t-x_*\|^2+ 2\gamma_{t} L_R V_t \\ &\qquad+2\gamma_t[\gamma_{t}L_Q(1+\zeta)-1]\bigl(Q(\overline{x}_t)+ R(\overline x_t)-Q(x_*)-R(x_*)\bigr) \\ &\leqslant(1-\gamma_{t}\mu_Q-\gamma_{t}\mu_R)\|\overline{x}_t-x_*\|^2+ 2\gamma_{t}L_R V_t \\ &\qquad-\dfrac{\gamma_t}{6}(F(\overline x_t)-F(x_*)). \end{aligned} \end{equation*} \notag $$

Using that $\mu_Q+\mu_R=\mu$ we arrive at the statement of the lemma. $\Box$

Now we are ready to combine the above results and obtain the descent lemma.

Lemma 8. For the averaged variables $\overline x_t$ obtained by iterations of Algorithm 1 with $\gamma_t \leqslant 1/(6L)$ for problem (1) under Assumption 1 the following holds:

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\|\overline{x}_{t+1}-x_*\|^2 &\leqslant (1-\gamma_{t}\mu)\mathsf{E}\|\overline{x}_t-x_*\|^2+ \gamma_{t}^2\mathsf{E}\|\overline{\boldsymbol{g}}_t-\overline{g}_t\|^2 \\ &\qquad-\dfrac{\gamma_{t}}{6}\,\mathsf{E}[F(\overline{x}_t)-F_*]+ 2\gamma_{t}L_R \mathsf{E}[V_t]. \end{aligned} \end{equation*} \notag $$

Proof. Using the update of Algorithm 1 we have

$$ \begin{equation*} \begin{aligned} \, \|\overline{x}_{t+1}-x_*\|^2&=\|\overline{x}_{t}- \gamma_{t}\overline{\boldsymbol{g}}_t-x_*\|^2= \|\overline{x}_{t}-\gamma_{t}\overline{\boldsymbol{g}}_t- x_*-\gamma_{t}\overline{g}_t+\gamma_{t}\overline{g}_t\|^2 \\ &=\|\overline{x}_t-x_*-\gamma_{t}\overline{g}_t\|^2+ \gamma_{t}^2\|\overline{\boldsymbol{g}}_t-\overline{g}_t\|^2- 2\gamma_{t}\langle\overline{x}_t-x_*-\gamma_{t}\overline{g}_t, \overline{\boldsymbol{g}}_t-\overline{g}_t\rangle. \end{aligned} \end{equation*} \notag $$

Taking the expectation gives

$$ \begin{equation*} \mathsf{E}\|\overline{x}_{t+1}-x_*\|^2= \mathsf{E}\|\overline{x}_t-x_*-\gamma_{t}\overline{g}_t\|^2+ \gamma_{t}^2\mathsf{E}\|\overline{\boldsymbol{g}}_t-\overline{g}_t\|^2. \end{equation*} \notag $$

Using the results of Lemma 7 we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\|\overline{x}_{t+1}-x_*\|^2&\leqslant (1-\gamma_{t}\mu) \mathsf{E}\|\overline{x}_t-x_*\|^2+\gamma_{t}^2\mathsf{E} \|\overline{\boldsymbol{g}}_t-\overline{g}_t\|^2 \\ &\qquad-\dfrac{\gamma_{t}}{6}\,\mathsf{E}[F(\overline{x}_t)-F_*]+ 2\gamma_{t}L_R\mathsf{E}[V_t]. \quad \Box \end{aligned} \end{equation*} \notag $$

Appendix D. Variance lemmas

In the previous section we proved the descent lemma (see Lemma 8). It remains to estimate $\mathsf{E}\|\overline{\boldsymbol{g}}_t-\overline{g}_t\|^2$ and $\mathsf{E}[V_t]$.

Lemma 9. For the stochastic gradients from Algorithm 1 for problem (1) under Assumptions 1 and 2 the following estimate holds:

$$ \begin{equation*} \mathsf{E}\|\overline{\boldsymbol{g}}_t-\overline{g}_t\|^2\leqslant \dfrac{\sigma^2}{M}+\dfrac{\rho L^2}{M}V_t+\dfrac{\rho L^2}{M} \|\overline{x}_t-x_*\|^2. \end{equation*} \notag $$

Proof. First we note that all the $\boldsymbol{g}^m_t$ are independent and unbiased; hence

$$ \begin{equation} \mathsf{E}\|\overline{\boldsymbol{g}}_t-\overline{g}_t\|^2= \mathsf{E}\biggl\|\dfrac{1}{M}\sum_{m=1}^{M}[\boldsymbol{g}^m_t-g^m_t]\biggr\|^2= \dfrac{1}{M^2}\sum_{m=1}^{M}\mathsf{E}\|\boldsymbol{g}^m_t-g^m_t\|^2. \end{equation} \tag{15} $$

Next, we use Assumption 2 to derive that

$$ \begin{equation} \begin{aligned} \, \dfrac{1}{M}\sum_{m=1}^{M}\mathsf{E}\|\boldsymbol{g}^m_t-g^m_t\|^2&\leqslant \sigma^2+\dfrac{\rho}{M} \sum_{m=1}^{M}\|g^m_t\|^2 \nonumber \\ &=\sigma^2+\dfrac{\rho}{M}\sum_{m=1}^{M}\|\nabla F(x^m_t)-\nabla F(x_*)\|^2 \nonumber \\ &\leqslant\sigma^2+\dfrac{\rho L^2}{M}\sum_{m=1}^{M}\|x^m_t-x_*\|^2 \nonumber \\ &=\sigma^2+\dfrac{\rho L^2}{M}\sum_{m=1}^{M}\|x^m_t- \overline{x}_t+\overline{x}_t-x_*\|^2 \nonumber \\ &=\sigma^2+\dfrac{\rho L^2}{M}\sum_{m=1}^{M}\bigl(\|x^m_t-\overline{x}_t\|^2+ \|\overline{x}_t-x_*\|^2\bigr) \nonumber \\ &\qquad+\dfrac{2\rho L^2}{M}\sum_{m=1}^{M}\langle x^m_t-\overline{x}_t, \overline{x}_t-x_*\rangle \nonumber \\ &=\sigma^2+\dfrac{\rho L^2}{M}\sum_{m=1}^{M}\bigl(\|x^m_t-\overline{x}_t\|^2+ \|\overline{x}_t-x_*\|^2\bigr) \nonumber \\ &=\sigma^2+\rho L^2 V_t+\rho L^2\|\overline{x}_t-x_*\|^2. \end{aligned} \end{equation} \tag{16} $$

Substituting (16) into (15) and applying the definition of $V_t$, we obtain the final result. $\Box$

Lemma 10. Suppose that Assumptions 1 and 2 hold.

(a) If $\rho=0$ and $\gamma_{t}=\gamma \leqslant 1/(6L)$, then

$$ \begin{equation} \mathsf{E}[V_t] \leqslant (H-1) \sigma^2 \gamma^2. \end{equation} \tag{17} $$

(b) If $\rho=0$, $\mu > 0$, and $\gamma_t=2/[\mu(\xi+t+1)]$ with $\xi \geqslant 0$, then

$$ \begin{equation} \mathsf{E}[V_t] \leqslant \dfrac{2(M-1)(H-1)\sigma^2\gamma_{t-1}^2}{M}\,. \end{equation} \tag{18} $$

$$ \begin{equation} \mathsf{E}[V_t] \leqslant \dfrac{2(M-1)(H-1)\sigma^2\gamma(t-H+1)\land 0}{M}\,. \end{equation} \tag{19} $$

(d) If $\mu > 0$ and $\gamma_t \leqslant \min\{\mu/(3\rho L^2),1/(6L)\}$, then

$$ \begin{equation} \mathsf{E} [V_t] \leqslant \sum_{i=kH}^{kH+a-1}\bigl(\rho\gamma_{i}^2 L^2 \mathsf{E}\|r_i\|^2+\gamma_{i}^2 \sigma^2\bigr)\prod_{j=i+1}^{kH+a-1} \biggl(1-\dfrac{\gamma_{j}\mu}{2}\biggr) \end{equation} \tag{20} $$

for $t=kH+a$.

Proof. First, note that assertions (a), (b), and (c) have already been proved ((a) in [15] and (b) and (c) in [34]). Now let us prove (d).

For $t \in \mathbb{N}$ we have $x^m_{t+1}=x^m_t-\gamma_{t}\boldsymbol{g}^m_t$ and $\overline{x}_{t+1}=\overline{x}_t-\gamma_{t}\boldsymbol{\overline{g}}_t$ if $t+1\ne 0\!\pmod H$. Hence for such $t$ and for the conditional expectation it is true that

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\|x^m_{t+1}-\overline{x}_{t+1}\|^2&=\|x^m_t-\overline{x}_t\|^2+ \gamma_{t}^2\mathsf{E}\|\boldsymbol{g}^m_t-\boldsymbol{\overline{g}}_t\|^2- 2\gamma_{t}\mathsf{E}\langle x^m_t-\overline{x}_t, \boldsymbol{g}^m_t-\boldsymbol{\overline{g}}_t\rangle \\ &=\|x^m_t-\overline{x}_t\|^2+\gamma_{t}^2\mathsf{E}\|\boldsymbol{g}^m_t- \boldsymbol{\overline{g}}_t\|^2- 2\gamma_{t}\langle x^m_t-\overline{x}_t,g^m_t\rangle \\ &\qquad+2\gamma_{t}\langle x^m_t-\overline{x}_t,\overline{g}_t\rangle. \end{aligned} \end{equation*} \notag $$

After averaging over $m$, the last terms collapse to zero. Therefore,

$$ \begin{equation} \mathsf{E}[V_{t+1}]=V_t+\dfrac{\gamma_{t}^2}{M}\sum_{m=1}^{M} \mathsf{E}\|\boldsymbol{g}^m_t-\boldsymbol{\overline{g}}_t\|^2- \dfrac{2\gamma_t}{M}\sum_{m=1}^{M}\langle x^m_t-\overline{x}_t,g^m_t\rangle. \end{equation} \tag{21} $$

By expanding the square in the second term of (21) we obtain

$$ \begin{equation} \begin{aligned} \, \mathsf{E}\|\boldsymbol{g}^m_t-\boldsymbol{\overline{g}}_t\|^2&= \mathsf{E}\|\boldsymbol{g}^m_t-\overline{g}_t+\overline{g}_t- \boldsymbol{\overline{g}}_t\|^2 \nonumber \\ &=\mathsf{E}\|\boldsymbol{g}^m_t-\overline{g}_t\|^2+ \mathsf{E}\|\boldsymbol{\overline{g}}_t-\overline{g}_t\|^2+ 2\mathsf{E}\langle\boldsymbol{g}^m_t-\overline{g}_t,\overline{g}_t- \boldsymbol{\overline{g}}_t\rangle. \end{aligned} \end{equation} \tag{22} $$

Now again,

$$ \begin{equation} \begin{aligned} \, \mathsf{E}\|\boldsymbol{g}^m_t-\overline{g}_t\|^2&= \mathsf{E}\|\boldsymbol{g}^m_t-g^m_t+g^m_t-\overline{g}_t\|^2 \nonumber \\ &=\mathsf{E}\|\boldsymbol{g}^m_t-g^m_t\|^2+\|g^m_t-\overline{g}_t\|^2+ 2\mathsf{E}\langle\boldsymbol{g}^m_t-g^m_t,g^m_t-\overline{g}_t\rangle \nonumber \\ &=\mathsf{E}\|\boldsymbol{g}^m_t-g^m_t\|^2+\|g^m_t-\overline{g}_t\|^2+ 2\langle g^m_t-g^m_t,g^m_t-\overline{g}_t\rangle \nonumber \\ &=\mathsf{E}\|\boldsymbol{g}^m_t-g^m_t\|^2+\|g^m_t-\overline{g}_t\|^2. \end{aligned} \end{equation} \tag{23} $$

Combining (22) and (23) we have

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\|\boldsymbol{g}^m_t-\boldsymbol{\overline{g}}_t\|^2&= \mathsf{E}\|\boldsymbol{g}^m_t-g^m_t\|^2+\|g^m_t-\overline{g}_t\|^2+ \mathsf{E}\|\boldsymbol{\overline{g}}_t-\overline{g}_t\|^2 \\ &\qquad+2\mathsf{E}[\langle\boldsymbol{g}^m_t-\overline{g}_t, \overline{g}_t-\boldsymbol{\overline{g}}_t\rangle]. \end{aligned} \end{equation*} \notag $$

By averaging both sides over $m$ we obtain

$$ \begin{equation} \begin{aligned} \, \dfrac{1}{M}\sum_{m=1}^{M}\mathsf{E}\|\boldsymbol{g}^m_t- \boldsymbol{\overline{g}}_t\|^2&=\dfrac{1}{M}\sum_{m=1}^{M} \mathsf{E}\|\boldsymbol{g}^m_t-g^m_t\|^2+ \dfrac{1}{M}\sum_{m=1}^{M}\|g^m_t-\overline{g}_t\|^2 \nonumber \\ &\qquad+\mathsf{E}\|\boldsymbol{\overline{g}}_t-\overline{g}_t\|^2+ 2\mathsf{E}\langle\boldsymbol{\overline{g}}_t-\overline{g}_t, \overline{g}_t-\boldsymbol{\overline{g}}_t\rangle \nonumber \\ &=\dfrac{1}{M}\sum_{m=1}^{M} \mathsf{E}\|\boldsymbol{g}^m_t-g^m_t\|^2+ \dfrac{1}{M}\sum_{m=1}^{M}\|g^m_t-\overline{g}_t\|^2 \nonumber \\ &\qquad+\mathsf{E}\|\boldsymbol{\overline{g}}_t-\overline{g}_t\|^2- 2\mathsf{E}\|\boldsymbol{\overline{g}}_t-\overline{g}_t\|^2 \nonumber \\ &\leqslant \dfrac{1}{M}\sum_{m=1}^{M}\mathsf{E}\|\boldsymbol{g}^m_t-g^m_t\|^2+ \dfrac{1}{M}\sum_{m=1}^{M}\|g^m_t-\overline{g}_t\|^2. \end{aligned} \end{equation} \tag{24} $$

Using Assumption 1 we bound the second term here as follows:

$$ \begin{equation} \begin{aligned} \, \dfrac{1}{M}\sum_{m=1}^{M}\|g^m_t-\overline{g}_t\|^2&=\dfrac{1}{M}\sum_{m=1}^{M} \|g^m_t-\nabla F(\overline{x}_t)+\nabla F(\overline{x}_t)-\overline{g}_t\|^2 \nonumber \\ &=\dfrac{1}{M}\sum_{m=1}^{M}\|g^m_t-\nabla F(\overline{x}_t)\|^2+ \|\nabla F(\overline{x}_t)-\overline{g}_t\|^2 \nonumber \\ &\qquad+\dfrac{2}{M}\sum_{m=1}^M\langle g^m_t-\nabla F(\overline{x}_t), \nabla F(\overline{x}_t)-\overline{g}_t\rangle \nonumber \\ &=\dfrac{1}{M}\sum_{m=1}^{M}\|g^m_t-\nabla F(\overline{x}_t)\|^2+ \|\nabla F(\overline{x}_t)-\overline{g}_t\|^2 \nonumber \\ &\qquad-2\|\nabla F(\overline{x}_t)-\overline{g}_t\|^2 \nonumber \\ &=\dfrac{1}{M}\sum_{m=1}^{M}\|g^m_t-\nabla F(\overline{x}_t)\|^2- \|\nabla F(\overline{x}_t)-\overline{g}_t\|^2 \nonumber \\ &\leqslant\dfrac{1}{M}\sum_{m=1}^{M}\|g^m_t-\nabla F(\overline{x}_t)\|^2 \nonumber \\ &=\dfrac{1}{M}\sum_{m=1}^{M}\|\nabla F(x^m_t)-\nabla F(\overline{x}_t)\|^2 \nonumber \\ &\leqslant\dfrac{1}{M}\sum_{m=1}^{M} 2L\bigl(F(\overline{x}_t)-F(x^m_t)- \langle\overline{x}_t-x^m_t,\nabla F(x^m_t)\rangle\bigr) \nonumber \\ &\overset{(7)}{\leqslant}\dfrac{2L}{M}\sum_{m=1}^{M} \langle x^m_t-\overline{x}_t,\nabla F(x^m_t)\rangle. \end{aligned} \end{equation} \tag{25} $$

Substituting (25) into (24) and applying Lemma 9, one can obtain

$$ \begin{equation*} \begin{aligned} \, \dfrac{1}{M}\sum_{m=1}^{M}\mathsf{E}\|\boldsymbol{g}^m_t- \boldsymbol{\overline{g}}_t\|&\leqslant\dfrac{1}{M}\sum_{m=1}^{M} \mathsf{E}\|\boldsymbol{g}^m_t-g^m_t\|^2+\dfrac{2L}{M} \sum_{m=1}^{M} \langle x^m_t-\overline{x}_t,\nabla F(x^m_t)\rangle \\ &\leqslant\sigma^2+\rho L^2 V_t+\rho L^2\|r_t\|^2 \\ &\qquad+\dfrac{2L}{M}\sum_{m=1}^{M}\langle x^m_t-\overline{x}_t, \nabla F(x^m_t)\rangle. \end{aligned} \end{equation*} \notag $$

Consequently, combining the above result with (21) we have

$$ \begin{equation} \begin{aligned} \, \mathsf{E}[V_{t+1}]&=V_t+\dfrac{\gamma_{t}^2}{M}\sum_{m=1}^{M} \mathsf{E}\|\boldsymbol{g}^m_t-\boldsymbol{\overline{g}}_t\|^2- \dfrac{2\gamma_{t}}{M}\sum_{m=1}^{M}\langle x^m_t-\overline{x}_t, \nabla F(x^m_t)\rangle \nonumber \\ &\leqslant V_t+\gamma_{t}^2\sigma^2+\gamma_{t}^2\rho L^2 V_t+ \gamma_{t}^2 \rho L^2 \|r_t\|^2 \nonumber \\ &\qquad-\dfrac{2\gamma_{t}}{M}(1-\gamma_{t} L)\sum_{m=1}^{M} \langle x^m_t-\overline{x}_t,\nabla F(x^m_t)\rangle. \end{aligned} \end{equation} \tag{26} $$

Let us analyze the last term. We know that $\gamma_{t} \leqslant 1/(6L)$, and therefore $1-\gamma_{t}L \geqslant 0$. Thus, by $\mu$-strong convexity and Jensen’s inequality (7),

$$ \begin{equation} \begin{aligned} \, &-\dfrac{2\gamma_{t}}{M}(1-\gamma_{t} L) \sum_{m=1}^{M} \langle x^m_t-\overline{x}_t,\nabla F(x^m_t)\rangle=\dfrac{2\gamma_{t}}{M} (1-\gamma_{t}L)\sum_{m=1}^{M}\langle\overline{x}_t-x^m_t, \nabla F(x^m_t)\rangle \nonumber \\ &\qquad\leqslant \dfrac{2\gamma_{t}}{M}(1-\gamma_{t} L)\sum_{m=1}^{M} \biggl(F(\overline{x}_t)-F(x^m_t)- \dfrac{\mu}{2}\|x^m_t-\overline{x}_t\|^2\biggr) \nonumber \\ &\qquad\overset{(7)}{\leqslant}-\dfrac{\gamma_{t}}{M}(1-\gamma_{t} L) \sum_{m=1}^{M}\mu\|x^m_t-\overline{x}_t\|^2= -\gamma_{t}(1-\gamma_{t} L)\mu V_t. \end{aligned} \end{equation} \tag{27} $$

Plugging (27) into (26) and using the bound $\gamma_{t} \leqslant 1/(6L)$ we obtain

$$ \begin{equation} \begin{aligned} \, \mathsf{E} [V_{t+1}]&\leqslant V_t+\gamma_{t}^2 \sigma^2+ \gamma_{t}^2 \rho L^2 V_t+\gamma_{t}^2 \rho L^2\|r_t\|^2- \gamma_{t} (1-\gamma_{t} L) \mu V_t \nonumber \\ &=[1-\gamma_{t}(1-\gamma_{t} L)\mu]V_t+\gamma_{t}^2 \sigma^2+ \gamma_{t}^2 \rho L^2 V_t+\gamma_{t}^2 \rho L^2\|r_t\|^2 \nonumber \\ &\leqslant \biggl(1-\dfrac{5\gamma_{t}\mu}{6}+\gamma_{t}^2 \rho L^2\biggr) V_t+\gamma_{t}^2\sigma^2+\gamma_{t}^2\rho L^2\|r_t\|^2, \end{aligned} \end{equation} \tag{28} $$

since $1-\gamma_t L \geqslant 5/6$. Moreover, we have $\gamma_{t} \leqslant \mu/(3\rho L^2)$, which means that

$$ \begin{equation*} -\dfrac{5\gamma_{t}\mu}{6}+\gamma_{t}^2\rho L^2 \leqslant -\dfrac{\gamma_{t}\mu}{2}\,. \end{equation*} \notag $$

Substituting this into (28) we obtain

$$ \begin{equation} \mathsf{E}[V_{t+1}]\leqslant\biggl(1-\dfrac{\gamma_{t}\mu}{2}\biggr)V_t+ \gamma_{t}^2\sigma^2+\gamma_{t}^2\rho L^2 \|r_t\|^2. \end{equation} \tag{29} $$

Representing $t=kH+a$ ($k,a \in \mathbb{N}$; $a < H$), recalling that $V_{kH}=0$, recursing (29), taking the full expectation, and taking into account that a product with no factors is equal to $1$, we conclude that

$$ \begin{equation*} \begin{aligned} \, \mathsf{E} [V_{t+1}]&\leqslant \sum_{i=kH}^{kH+a} \bigl(\rho\gamma_{i}^2 L^2\mathsf{E}\|r_i\|^2+\gamma_{i}^2\sigma^2\bigr) \prod_{j=i+1}^{kH+a}\biggl(1-\dfrac{\gamma_{j}\mu}{2}\biggr) \\ &\qquad+\prod_{i=kH}^{kH+a}\biggl(1-\dfrac{\gamma_{i}\mu}{2}\biggr)\cdot V_{kH} \\ &=\sum_{i=kH}^{kH+a}\bigl(\rho\gamma_{i}^2 L^2\mathsf{E}\|r_i\|^2+ \gamma_{i}^2 \sigma^2\bigr)\prod_{j=i+1}^{kH+a} \biggl(1-\dfrac{\gamma_{j}\mu}{2}\biggr). \quad \Box \end{aligned} \end{equation*} \notag $$

Appendix E. Proof of Theorem 1

We tune the stepsizes $\gamma_t$ as in Theorem 2 from [34]. The proof is divided into two parts, with two different techniques for choosing $\gamma_t$.

Case (a): $T \leqslant 2\kappa$. In this case we choose the constant stepsizes $\gamma_t=\gamma=1/(6L)$ and the sequence of weights $w_t:=(1-\mu \gamma)^{-t-1}$. Substituting the results of Lemmas 9 and 10 into Lemma 8 we obtain

$$ \begin{equation} \begin{aligned} \, \mathsf{E}\|\overline{x}_{t+1}-x_*\|^2&\leqslant(1-\gamma \mu) \mathsf{E} \|\overline{x}_t-x_*\|^2+\dfrac{\gamma^2\sigma^2}{M}- \dfrac{\gamma}{6}\,\mathsf{E}[F(\overline{x}_t)-F(x_*)] \nonumber \\ &\qquad+2L_R(H-1)\gamma^3 \sigma^2. \end{aligned} \end{equation} \tag{30} $$

Rearranging yields

$$ \begin{equation} \begin{aligned} \, \dfrac{\gamma}{6}\,\mathsf{E}[F(\overline{x}_t)-F(x_*)] &\leqslant (1-\gamma\mu)\mathsf{E}\|\overline{x}_t-x_*\|^2- \mathsf{E}\|\overline{x}_{t+1}-x_*\|^2+\dfrac{\gamma^2\sigma^2}{M} \nonumber \\ &\qquad+2L_R(H-1)\gamma^3\sigma^2. \end{aligned} \end{equation} \tag{31} $$

Dividing both sides by $\gamma$ gives the inequality

$$ \begin{equation*} \begin{aligned} \, \dfrac{1}{6}\,\mathsf{E}[F(\overline{x}_t)-F(x_*)] &\leqslant \dfrac{1}{\gamma}(1-\gamma \mu)\mathsf{E}\|\overline{x}_t-x_*\|^2- \dfrac{1}{\gamma}\,\mathsf{E}\|\overline{x}_{t+1}-x_*\|^2+ \dfrac{\gamma\sigma^2}{M} \\ &\qquad+2L_R(H-1)\gamma^2\sigma^2. \end{aligned} \end{equation*} \notag $$

Multiplying both sides by $w_t$ and summing from $0$ to $T$ we derive that

$$ \begin{equation*} \begin{aligned} \, \dfrac{1}{6}\sum_{t=0}^{T-1}w_t\mathsf{E}[F(\overline{x}_t)-F(x_*)]&\leqslant \sum_{t=0}^{T-1} w_t\biggl(\dfrac{1}{\gamma}(1-\gamma \mu) \mathsf{E} |\overline{x}_t-x_*\|^2-\dfrac{1}{\gamma}\,\mathsf{E} \|\overline{x}_{t+1}-x_*\|^2 \\ &\qquad+\dfrac{\gamma \sigma^2}{M}+2 L_R (H-1)\gamma^2\sigma^2\biggr) \\ &=\sum_{t=0}^{T-1}\biggl(\dfrac{1}{\gamma}(1-\gamma \mu)^{-t} \mathsf{E} \|\overline{x}_{t}-x_*\|^2 \\ &\qquad-\dfrac{1}{\gamma}(1-\gamma \mu)^{-(t+1)} \mathsf{E} \|\overline{x}_{t+1}-x_*\|^2 \\ &\qquad+w_t 2L_R(H-1)\gamma^2\sigma^2+\dfrac{w_t\gamma\sigma^2}{M}\biggr) \\ &\leqslant\dfrac{1}{\gamma}\,\mathsf{E}\|x_0-x_*\|^2+ \biggl(\dfrac{\gamma\sigma^2}{M}+2L_R (H-1)\gamma^2\sigma^2\biggr) \sum_{t=0}^{T-1} w_t. \end{aligned} \end{equation*} \notag $$

Set $W_T:=\displaystyle\sum_{t=0}^{T-1} w_t$. Then multiplying both parts by $1/W_t$ one obtains

$$ \begin{equation*} \dfrac{1}{6 W_T}\sum_{t=0}^{T}w_t\mathsf{E}[F(\overline{x}_t)-F(x_*)]\leqslant \dfrac{1}{\gamma W_T}\,\mathsf{E}\|x_0-x_*\|^2+\dfrac{\gamma\sigma^2}{M}+ 2L_R(H-1)\gamma^2 \sigma^2. \end{equation*} \notag $$

Applying Jensen’s inequality (7) to $\widetilde{x}_T:=\dfrac{1}{W_T}\displaystyle\sum_{t=0}^{T}w_t\overline{x}_t$ and using the bound $W_T \geqslant (1-\mu\gamma)^{-T} \geqslant \exp{(\mu\gamma T)}$, we have

$$ \begin{equation*} \dfrac{1}{6}\,\mathsf{E}[F(\widetilde{x}_T)-F(x_*)]\leqslant \dfrac{1}{\gamma}\exp{(-\mu \gamma T)}\|x_0-x_*\|^2+\dfrac{\gamma\sigma^2}{M}+ 2L_R(H-1)\gamma^2\sigma^2. \end{equation*} \notag $$

Recalling Assumption 1 and relations $\gamma=1/(6L)$ and $2L \geqslant \mu T$ we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E} [F(\widetilde{x}_T)-F(x_*)]&\leqslant 36 L\exp{\biggl(-\dfrac{\mu T}{6L}\biggr)}\|x_0-x_*\|^2+ \dfrac{2\sigma^2}{\mu T M}+\dfrac{12L_R (H-1)\sigma^2}{9\mu^2 T^2} \\ &=\mathcal{O}\biggl(L\exp{\biggl(-\dfrac{\mu T}{6 L}\biggr)} \|x_0-x_*\|^2+\dfrac{\sigma^2}{\mu T M}+ \dfrac{\varepsilon L \sigma^2}{\mu^2 TK}\biggr), \end{aligned} \end{equation*} \notag $$

which finishes the proof.

Case (b): $T > 2\kappa$. Without loss of generality we can put $T$ to be even to have simpler notation. Let us choose the stepsize in the following way:

$$ \begin{equation*} \begin{aligned} \, \gamma_t=\begin{cases} \dfrac{1}{6L} & \text{if}\ \ t \leqslant \dfrac{T}{2}\,, \vphantom{\biggl\}} \\ \dfrac{2}{\mu(\xi+t)} & \text{if}\ \ t > \dfrac{T}{2}\,, \end{cases} \end{aligned} \end{equation*} \notag $$

where $\xi=12 L/\mu-T/2$. It is noteworthy that $\gamma_t \leqslant 1/(6L)$. Therefore, we can apply Lemma 10.

Since $F(x)-F^* \geqslant 0$ for every $x$, we start as follows:

$$ \begin{equation*} \mathsf{E}\|x_{T/2}-x_*\|^2\leqslant (1-\gamma\mu)\mathsf{E} \|x_{T/2-1}-x_*\|^2+\dfrac{\gamma^2\sigma^2}{M}+2L_R(H-1)\gamma^3\sigma^2. \end{equation*} \notag $$

Running recursion for the above relation, we derive that

$$ \begin{equation} \begin{aligned} \, \mathsf{E}\|x_{T/2}-x_*\|^2&\leqslant (1-\gamma \mu)^{T/2}\|x_{0}-x_*\|^2+ \dfrac{\gamma\sigma^2}{\mu M}+\dfrac{2L_R(H-1)\gamma^2\sigma^2}{\mu} \nonumber \\ &=\biggl(1-\dfrac{\mu}{6L}\biggr)^{T/2}\|x_{0}-x_*\|^2+ \dfrac{\sigma^2}{6L\mu M}+\dfrac{L_R (H-1)\sigma^2}{18\mu L^2} \nonumber \\ &\leqslant \biggl(1-\dfrac{\mu}{6L}\biggr)^{T/2}\|x_{0}-x_*\|^2+ \dfrac{\sigma^2}{6L\mu M}+\dfrac{\varepsilon(H-1)\sigma^2}{18\mu L} \nonumber \\ &\leqslant\exp{\biggl(\dfrac{-\mu T}{12L}\biggr)}\|x_{0}-x_*\|^2+ \dfrac{\sigma^2}{6L\mu M}+\dfrac{\varepsilon(H-1)\sigma^2}{18\mu L}\,, \end{aligned} \end{equation} \tag{32} $$

where we have used that $\displaystyle\sum_{i=0}^t(1-\gamma \mu)^i \leqslant \dfrac{1}{\gamma\mu}$ and $L_R=\varepsilon L$. For the second part we apply a technique similar to case (a), but with $w_t=\xi+t$ and $W_T=\displaystyle\sum_{t=T/2}^{T-1}w_t$:

$$ \begin{equation*} \begin{aligned} \, &\dfrac{1}{6W_T}\sum_{t=T/2}^{T-1}w_t\mathsf{E}[F(\overline{x}_t)-F(x_*)] \\ &\qquad\leqslant \dfrac{1}{W_T}\sum_{t=T/2}^{T-1}w_t \biggl(\biggl(\dfrac{1}{\gamma_t}-\mu\biggr)\mathsf{E}\|x_t-x_*\|^2- \dfrac{1}{\gamma_t}\,\mathsf{E}\|x_{t+1}-x_*\|^2 \\ &\qquad\qquad+\dfrac{\gamma\sigma^2}{M}+2L_R(H-1)\gamma^2\sigma^2\biggr) \\ &\qquad=\dfrac{1}{W_T}\sum_{t=T/2}^{T-1}\biggl(\mu\biggl(\dfrac{\xi}{2}+ \dfrac{t}{2}\biggr)(t+\xi-2)\mathsf{E}\|x_{t}-x_*\|^2-\dfrac{\mu(t+\xi)^2}{2}\, \mathsf{E}\|x_{t+1}-x_*\|^2 \\ &\qquad\qquad+\dfrac{2\sigma^2}{\mu M}+ \dfrac{8L_R(H-1)\sigma^2}{\mu^2(\xi+t)}\biggr) \\ &\qquad\leqslant \dfrac{1}{W_T}\sum_{t=T/2}^{T-1} \biggl(\dfrac{\mu(t+\xi-1)^2}{2}\,\mathsf{E}\|x_{t}-x_*\|^2- \dfrac{\mu(t+\xi)^2}{2}\,\mathsf{E}\|x_{t+1}-x_*\|^2 \\ &\qquad\qquad+\dfrac{2\sigma^2}{\mu M}+ \dfrac{8L_R(H-1)\sigma^2}{\mu^2(\xi+t)}\biggr) \\ &\qquad\leqslant \dfrac{\mu(T/2+\xi-1)^2}{W_T}\,\mathsf{E} \|x_{T/2}-x_*\|^2+\dfrac{T\sigma^2}{W_T\mu M}+ \dfrac{8L_R(H-1)\sigma^2}{W_T\mu^2}\sum_{t=T/2}^{T-1} \dfrac{1}{\xi+t}\,. \end{aligned} \end{equation*} \notag $$

Recalling the definition of $\xi$ we obtain

$$ \begin{equation*} \begin{aligned} \, \sum_{t=T/2}^{T-1}\dfrac{1}{\xi+t}&=\sum_{t=0}^{T/2-1}\dfrac{1}{(t+12L/\mu)} \leqslant\sum_{t=0}^{T/2}\dfrac{1}{t+L/\mu} \leqslant \int_{1}^{T/2+L/\mu} \dfrac{dx}{x}+\dfrac{\mu}{L} \\ &= \dfrac{\mu}{L}+\log\biggl(\dfrac{T}{2}+\dfrac{L}{\mu}\biggr) \leqslant \dfrac{\mu}{L}+\log T, \end{aligned} \end{equation*} \notag $$

where we have used that $T \geqslant 2\kappa$. Therefore,

$$ \begin{equation*} \begin{aligned} \, \dfrac{1}{6 W_T}\sum_{t=T/2}^{T-1}w_t\mathsf{E}[F(\overline{x}_t)-F(x_*)] &\leqslant\dfrac{\mu(T/2+\xi-1)^2}{W_T}\,\mathsf{E}\|x_{T/2}-x_*\|^2 \\ &\qquad+\dfrac{T\sigma^2}{W_T\mu M}+\dfrac{8L_R (H-1)\sigma^2}{W_T\mu^2} \biggl(\dfrac{\mu}{L}+\log T\biggr) \\ &=\dfrac{\mu(12L/\mu-1)^2}{W_T}\,\mathsf{E}\|x_{T/2}-x_*\|^2+ \dfrac{T \sigma^2}{W_T\mu M} \\ &\qquad+\dfrac{8L_R(H-1)\sigma^2}{W_T\mu^2} \biggl(\dfrac{\mu}{L}+\log T\biggr). \end{aligned} \end{equation*} \notag $$

Observe that

$$ \begin{equation} \begin{aligned} \, W_T=\sum_{t=T/2}^{T-1}(\xi+t)=\sum_{t=0}^{T/2-1} \biggl(\dfrac{12L}{\mu}+t\biggr) \geqslant \dfrac{T^2}{16}\,, \end{aligned} \end{equation} \tag{33} $$

since without loss of generality we let $T \geqslant 4$. Thus, substituting in the above inequality we obtain

$$ \begin{equation} \begin{aligned} \, \dfrac{1}{6 W_T}\sum_{t=T/2}^{T-1}w_t\mathsf{E}[F(\overline{x}_t)-F(x_*)] &\leqslant\dfrac{\mu(12L/\mu-1)^2}{W_T}\mathsf{E}\|x_{T/2}-x_*\|^2+ \dfrac{16\sigma^2}{\mu T M} \nonumber \\ &\qquad+\dfrac{128L_R(H-1)\sigma^2}{T^2\mu^2} \biggl(\dfrac{\mu}{L}+\log T\biggr). \end{aligned} \end{equation} \tag{34} $$

Applying Jensen’s inequality (7) to $\widetilde{x}_T=\dfrac{1}{W_T}\displaystyle\sum_{t=T/2}^{T-1}w_t\overline{x}_t$ gives the inequality

$$ \begin{equation} \begin{aligned} \, F(\widetilde{x}_T) \leqslant \dfrac{1}{W_T}\sum_{t=T/2}^{T}w_t F(\overline{x}_t). \end{aligned} \end{equation} \tag{35} $$

Using that $T \geqslant 2L/\mu$ and combining (32), (33), (34), and (35) we have

$$ \begin{equation*} \begin{aligned} \, &\mathsf{E}[F(\widetilde{x}_t)-F(x_*)]\\ &\qquad\leqslant \dfrac{6\mu (12L/\mu-1)^2}{W_T}\biggl(\exp{\biggl(\dfrac{-\mu T}{12L}\biggr)} \|x_{0}-x_*\|^2+\dfrac{\sigma^2}{6L\mu M}+ \dfrac{\varepsilon(H-1)\sigma^2}{18\mu L}\biggr) \\ &\qquad\qquad\qquad+\dfrac{96\sigma^2}{\mu T M}+\dfrac{768L_R(H-1)\sigma^2}{T^2\mu^2} \biggl(\dfrac{\mu}{L}+\log T\biggr) \\ &\qquad\leqslant \dfrac{6\mu(12L/\mu)^2}{L^2/(4\mu^2)} \exp{\biggl(\dfrac{-\mu T}{12L}\biggr)}\|x_{0}-x_*\|^2+ \dfrac{(12L/\mu)^2\sigma^2}{W_T L M} \\ &\qquad\qquad\qquad+\dfrac{(12L/\mu)^2\varepsilon H\sigma^2}{3 W_T L}+ \dfrac{96\sigma^2}{\mu TM}+\dfrac{768L_R(H-1)\sigma^2}{T^2\mu^2} \biggl(\dfrac{\mu}{L}+\log T\biggr) \\ &\qquad\leqslant 3456\mu\exp{\biggl(\dfrac{-\mu T}{12L}\biggr)}\|x_{0}-x_*\|^2+ \dfrac{12^2 \cdot 8 \sigma^2}{\mu TM}+ \dfrac{16^2 \cdot 3 \varepsilon L H \sigma^2}{\mu^2 T^2} \\ &\qquad\qquad\qquad+\dfrac{96\sigma^2}{\mu T M}+\dfrac{768\varepsilon LH\sigma^2}{T^2\mu^2} \biggl(\dfrac{\mu}{L}+\log T\biggr) \\ &\qquad=\widetilde{\mathcal{O}}\biggl(L\exp{\biggl(\dfrac{-\mu T}{L}\biggr)} \|x_{0}-x_*\|^2+\dfrac{\sigma^2}{\mu TM}+ \dfrac{\varepsilon L\sigma^2}{\mu^2 TK}\biggr), \end{aligned} \end{equation*} \notag $$

which completes the proof of Theorem 1. $\Box$

Appendix F. Proof of Theorem 2

Considering constant stepsizes $\gamma_t=\gamma \leqslant 1/(6L)$, $\mu=0$, and $\rho=0$ and applying the results of Lemmas 9 and 10 to Lemma 8, we obtain

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}\|\overline{x}_{t+1}-x_*\|^2 &\leqslant \mathsf{E}\|\overline{x}_t-x_*\|^2+\dfrac{\gamma^2\sigma^2}{M} \\ &\qquad-\dfrac{\gamma}{6}\mathsf{E}[F(\overline{x}_t)-F(x_*)]+ 2\gamma L_R(H-1)\gamma^2\sigma^2. \end{aligned} \end{equation*} \notag $$

Set $r_t:=\overline{x}_t-x_*$. Then rearranging the above relation yields

$$ \begin{equation*} \begin{aligned} \, \dfrac{\gamma}{6}\,\mathsf{E}[F(\overline{x}_t)-F(x_*)]\leqslant \mathsf{E}\|r_t\|^2-\mathsf{E}\|r_{t+1}\|^2+\dfrac{\gamma^2\sigma^2}{M}+ 2L_R(H-1)\gamma^3\sigma^2. \end{aligned} \end{equation*} \notag $$

Averaging over $t$ gives

$$ \begin{equation} \begin{aligned} \, \dfrac{\gamma}{6 T}\sum_{t=0}^{T-1}\mathsf{E}[F(\overline{x}_t)- F(x_*)]&\leqslant \dfrac{1}{T}\sum_{t=0}^{T-1} \bigl(\mathsf{E}\|r_t\|^2-\mathsf{E}\|r_{t+1}\|^2\bigr)+ \dfrac{\gamma^2\sigma^2}{M} \nonumber \\ &\qquad+2L_R(H-1)\gamma^3\sigma^2 \nonumber \\ &=\dfrac{\|r_0\|^2-\mathsf{E}\|r_T\|^2}{T}+\dfrac{\gamma^2\sigma^2}{M}+ 2L_R(H-1)\gamma^3\sigma^2 \nonumber \\ &\leqslant\dfrac{\|r_0\|^2}{T}+\dfrac{\gamma^2\sigma^2}{M}+ 2L_R(H-1)\gamma^3\sigma^2. \end{aligned} \end{equation} \tag{36} $$

To $\widehat{x}_T=\dfrac{1}{T}\displaystyle\sum_{t=0}^{T-1}\overline{x}_t$ we apply Jensen’s inequality (7):

$$ \begin{equation} \begin{aligned} \, \mathsf{E}[F(\widehat{x}_T)-F(x_*)] \leqslant \dfrac{1}{T}\sum_{t=0}^{T-1}\mathsf{E}[F(\overline{x}_t)-F(x_*)]. \end{aligned} \end{equation} \tag{37} $$

Plugging (37) into (36) we obtain

$$ \begin{equation*} \dfrac{\gamma}{6}\,\mathsf{E}[F(\widehat{x}_T)-F(x_*)]\leqslant \dfrac{\|r_0\|^2}{T}+\dfrac{\gamma^2\sigma^2}{M}+2L_R(H-1)\gamma^3\sigma^2. \end{equation*} \notag $$

Dividing both sides by $\gamma/6$ we have

$$ \begin{equation*} \mathsf{E}[F(\widehat{x}_T)-F(x_*)]\leqslant \dfrac{6}{\gamma T}\|r_0\|^2+ \dfrac{6\gamma\sigma^2}{M}+12L_R(H-1)\gamma^2\sigma^2. \end{equation*} \notag $$

Now choosing $\gamma$ by the formula

$$ \begin{equation*} \gamma=\begin{cases} \min\biggl\{\dfrac{1}{6L}\,,\dfrac{\|r_0\|\sqrt{M}}{\sigma\sqrt{T}}\biggr\}_{\vphantom \sum} &\text{if}\ H=1\ \text{or}\ M=1, \\ \min\biggl\{\dfrac{1}{6L}\,,\dfrac{\|r_0\|\sqrt{M}}{\sigma\sqrt{T}}\,, \biggl(\dfrac{\|r_0\|^2}{L_R\sigma^2TH}\biggr)^{1/3}\biggr\} &\text{otherwise}, \end{cases} \end{equation*} \notag $$

we conclude that

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}[F(\widehat{x}_T)-F(x_*)] &\leqslant \max\biggl\{\dfrac{36L\|r_0\|^2}{T}\,,\dfrac{6\sigma\|r_0\|}{\sqrt{MT}}\,, 6\biggl(\dfrac{L_R\sigma^2\|r_0\|^4}{TK}\biggr)^{1/3}\biggr\} \\ &\qquad\qquad+\dfrac{6\sigma\|r_0\|}{\sqrt{MT}}+ 12\biggl(\dfrac{L_R\sigma^2\|r_0\|^4}{TK}\biggr)^{1/3} \\ &\qquad=\mathcal{O}\biggl(\dfrac{L\|r_0\|^2}{T}+\dfrac{\sigma\|r_0\|}{\sqrt{MT}} +\biggl(\dfrac{\varepsilon L_R\sigma^2\|r_0\|^4}{TK}\biggr)^{1/3}\biggr). \end{aligned} \end{equation*} \notag $$

This finalizes the proof of Theorem 2. $\Box$

Appendix G. Proof of Theorem 3

We start with substituting the result of Lemma 9 into Lemma 8:

$$ \begin{equation*} \begin{aligned} \, \dfrac{\gamma_{t}}{6}\,\mathsf{E}[F(\overline{x}_t)-F(x_*)] &\leqslant (1-\gamma_{t}\mu)\mathsf{E}\|r_t\|^2+ \gamma_{t}^2\mathsf{E}\|\overline{\boldsymbol{g}}_t-\overline{g}_t\|^2- \mathsf{E}\|r_{t+1}\|^2+2\gamma_{t}L_R\mathsf{E}[V_t] \\ &\leqslant (1-\gamma_{t}\mu)\mathsf{E}\|r_t\|^2-\mathsf{E}\|r_{t+1}\|^2+ 2\gamma_{t}L_R\mathsf{E}[V_t] \\ &\qquad+\gamma_{t}^2\biggl(\dfrac{\sigma^2}{M}+ \dfrac{\rho L^2}{M}\,\mathsf{E}[V_t]+ \dfrac{\rho L^2}{M}\,\mathsf{E}\|r_t\|^2\biggr). \end{aligned} \end{equation*} \notag $$

Suppose that $\gamma_{t}$ satisfies the assumption of Lemma 10 for case (d):

$$ \begin{equation*} \gamma_t \leqslant \min\biggl\{\frac{\mu}{3\rho L^2},\frac{1}{6L}\biggr\}. \end{equation*} \notag $$

Therefore, for $t=kH+a$ we have

$$ \begin{equation} \begin{aligned} \, &\dfrac{\gamma_{t}}{6}\,\mathsf{E}[F(\overline{x}_t)-F(x_*)] \nonumber \\ &\quad\leqslant (1-\gamma_{t}\mu)\mathsf{E}\|r_t\|^2-\mathsf{E}\|r_{t+1}\|^2+ 2\gamma_{t}L_R\mathsf{E}[V_t]+\gamma_{t}^2\biggl(\dfrac{\sigma^2}{M}+ \dfrac{\rho L^2}{M}V_t+\dfrac{\rho L^2}{M}\,\mathsf{E}\|r_t\|^2\biggr) \nonumber \\ &\quad=\biggl(1-\gamma_{t}\mu+\dfrac{\gamma^2_t\rho L^2}{M}\biggr)\mathsf{E} \|r_t\|^2-\mathsf{E}\|r_{t+1}\|^2+\dfrac{\gamma_{t}^2\sigma^2}{M}+ \biggl(\dfrac{\gamma_{t}^2\rho L^2}{M}+2\gamma_{t}L_R\biggr)\mathsf{E}[V_t] \nonumber \\ &\quad\leqslant\biggl(1-\dfrac{2\gamma_{t}\mu}{3}\biggr)\mathsf{E}\|r_t\|^2- \mathsf{E}\|r_{t+1}\|^2+\dfrac{\gamma_{t}^2\sigma^2}{M} +\biggl(\dfrac{\gamma_t\rho L^2}{M}+2L_R\biggr) \nonumber \\ &\quad\qquad\times\gamma_t\biggl(\,\sum_{i=kH}^{kH+a-1} \bigl(\rho\gamma_{i}^2 L^2\mathsf{E}\|r_i\|^2+\gamma_{i}^2\sigma^2\bigr) \prod_{j=i+1}^{kH+a-1}\biggl(1-\dfrac{\gamma_{j}\mu}{2}\biggr)\biggr), \end{aligned} \end{equation} \tag{38} $$

where we have used that $\gamma_t \leqslant \mu/(3\rho L^2)$. Next, assume that $\gamma_t \equiv \gamma$ and introduce some parameters for a convenient reformulation of (38):

$$ \begin{equation} \begin{aligned} \, \beta_1&:=1-\dfrac{2\gamma\mu}{3}\,, \\ \beta_2&:=1-\dfrac{\gamma\mu}{2}\,, \\ C_0&:=\dfrac{\gamma\rho L^2}{M}+2L_R\,, \\ C&:=\dfrac{\mu}{3M}+2L_R \geqslant C_0. \end{aligned} \end{equation} \tag{39} $$

As a consequence, using the notation (39) we obtain

$$ \begin{equation} \begin{aligned} \, \dfrac{1}{6}\mathsf{E}[F(\overline{x}_t)-F(x_*)] &\leqslant \dfrac{\beta_1}{\gamma}\mathsf{E}\|r_t\|^2- \dfrac{1}{\gamma}\mathsf{E}\|r_{t+1}\|^2+\dfrac{\gamma\sigma^2}{M} \nonumber \\ &\qquad+C_0\sum_{i=kH}^{t-1}\bigl(\rho\gamma^2 L^2\mathsf{E}\|r_i\|^2+ \gamma^2 \sigma^2\bigr)\beta_2^{t-i -1}. \end{aligned} \end{equation} \tag{40} $$

The technique applied above is the same as in the previous theorems. Nevertheless, we start by considering only part of all iterations to be summarized. That is, assume that $t \in \{nH,\dots,(n+1)H-1\}$ for some $n \in \mathbb{N}$. Accordingly, summing with the weights $w_t$, we have

$$ \begin{equation*} \begin{aligned} \, \sum_{t=nH}^{(n+1)H-1}\dfrac{w_t}{6}\,\mathsf{E}[F(\overline{x}_t)-F(x_*)] &\leqslant \sum_{t=nH}^{(n+1)H-1} \biggl(\dfrac{w_t\beta_1}{\gamma}\,\mathsf{E}\|r_t\|^2- \dfrac{w_t}{\gamma}\,\mathsf{E}\|r_{t+1}\|^2+\dfrac{w_t\gamma\sigma^2}{M} \\ &\qquad+C_0 w_t\sum_{i=nH}^{t-1}\bigl(\rho\gamma^2 L^2\mathsf{E} \|r_i\|^2+\gamma^2 \sigma^2\bigr)\beta_2^{t-i-1}\biggr). \end{aligned} \end{equation*} \notag $$

Let us find the coefficient of $\mathsf{E}\|r_p\|^2$ for $p \in \{nH,\dots,(n+1)H\}$ in the above inequality. First we start with $p \in \{nH+1,\dots,(n+1)H-2\}$. After removing brackets, it turns out that this coefficient is equal to

$$ \begin{equation} \begin{aligned} \, \dfrac{w_p\beta_1}{\gamma}-\dfrac{w_{p-1}}{\gamma}+ C_0\rho\gamma^2L^2\beta_2^{-p-1}\sum_{k=p+1}^{(n+1)H-1}w_k\beta_2^k. \end{aligned} \end{equation} \tag{41} $$

Now, let us choose $w_t:=(1-\gamma\mu/2)^{-(t+1)}$. Then the multiplicative factor can be bounded above by

$$ \begin{equation} \begin{aligned} \, \dfrac{w_p\beta_1}{\gamma}-\dfrac{w_{p-1}}{\gamma}+ \dfrac{C\rho \gamma^2L^2Hw_p}{1-\gamma\mu/2} \end{aligned} \end{equation} \tag{42} $$

since $(n+1)H-p-1 \leqslant H$. Multiplying both sides by $(1-\gamma\mu/2)^{p+2}$ and making the substitutions (39) gives

$$ \begin{equation*} \dfrac{1}{\gamma}\biggl(-\dfrac{\gamma\mu}{6}\biggl(1- \dfrac{\gamma\mu}{2}\biggr)\biggr)+C\rho \gamma^2L^2H \leqslant 0. \end{equation*} \notag $$

Solving the quadratic inequality and applying the fact that $\sqrt{a^2+b^2} \leqslant a+b$, one can obtain

$$ \begin{equation} \begin{aligned} \, \gamma \leqslant \dfrac{\sqrt{\mu}}{\sqrt{6CH\rho L^2}}\,. \end{aligned} \end{equation} \tag{43} $$

If we choose $\gamma$ in accordance with (43), then for all $n$ and all $p$ such that $p \in \{nH+1,\dots,(n+1)H-2\}$ the coefficient at $\mathsf{E}\|r_p\|^2$ is non-positive. Moreover, the case $p=(n+1)H-1$ is obvious since then the coefficient is equal to

$$ \begin{equation*} \dfrac{w_p\beta_1}{\gamma}-\dfrac{w_{p-1}}{\gamma}\,. \end{equation*} \notag $$

Because of (42) it is less than $0$. The last case is when $p$ is a moment of communication. For final convergence it is necessary to sum all terms responsible for the period between communications, and therefore for $p =0 \! \pmod H$ the coefficient at $\mathsf{E}\|r_p\|^2$ is equal to the sum of two terms: one appears in summing from $nH$ to $(n+1)H-1$ and another in summing from $(n-1)H$ to $nH-1$. However, the latter is equal to (41) for $p=nH$. Applying the bound (42), we claim that the coefficient is also non-positive. Consequently, for $T=KH$ we obtain

$$ \begin{equation*} \begin{aligned} \, &\sum_{n=0}^{K-1}\,\sum_{t=nH}^{(n+1)H-1}\dfrac{w_t}{6}\,\mathsf{E} [F(\overline{x}_t)-F(x_*)]\\ &\qquad \leqslant \|r_0\|^2\biggl(\dfrac{w_0}{\gamma}\beta_1+C\rho\gamma^2 L^2w_0 \sum_{k=1}^{H-1}\beta_2^kw_k\biggr) -\mathsf{E}\|r_T\|^2\cdot\dfrac{w_{T-1}}{\gamma} \\ &\qquad\qquad+\sum_{n=0}^{K-1}\,\sum_{t=nH}^{(n+1)H-1}C_0w_t\gamma^2\sigma^2 \sum_{i=nH}^{t-1}\beta_2^{t-i-1} +\sum_{t=0}^{T-1}\dfrac{w_t\gamma\sigma^2}{M} \\ &\qquad\leqslant \|r_0\|^2\biggl(\dfrac{w_0}{\gamma}\beta_1+ C\rho\gamma^2 L^2w_0\sum_{k=1}^{H-1}\beta_2^kw_k\biggr) \\ &\qquad\qquad+\sum_{t=0}^{T-1}\biggl(\dfrac{w_t\gamma\sigma^2}{M}+ C_0w_tH\gamma^2\sigma^2\biggr). \end{aligned} \end{equation*} \notag $$

Therefore, using that $\beta_1 \leqslant \beta_2$ and $\beta_2^k w_k=\dfrac{1}{1-\gamma\mu/2}$ , we have

$$ \begin{equation} \begin{aligned} \, \sum_{n=0}^{K-1}\,\sum_{t=nH}^{(n+1)H-1}\dfrac{w_t}{6}\, \mathsf{E}[F(\overline{x}_t)-F(x_*)] &\leqslant \|r_0\|^2\biggl(\dfrac{1}{\gamma}+ \dfrac{CH\rho\gamma^2 L^2w_0}{1-\gamma\mu/2}\biggr) \nonumber \\ &\qquad+\sum_{t=0}^{T-1} \biggl(\dfrac{w_t\gamma\sigma^2}{M}+ C_0w_tH\gamma^2\sigma^2\biggr) \nonumber \\ &\leqslant \|r_0\|^2\biggl(\dfrac{1}{\gamma}+\dfrac{24\mu}{121}\biggr) \nonumber \\ &\qquad+\sum_{t=0}^{T-1}\biggl(\dfrac{w_t\gamma\sigma^2}{M}+ C_0w_tH\gamma^2\sigma^2\biggr) \end{aligned} \end{equation} \tag{44} $$

since $\gamma \leqslant \dfrac{1}{6L}$ and $\gamma \leqslant \dfrac{\sqrt{\mu}}{\sqrt{6CH\rho L^2}}$ . Dividing both sides by $W_t=\displaystyle\sum_{t=0}^{T-1}w_t$ and applying Jensen’s inequality (7) to (44), one can obtain

$$ \begin{equation*} \mathsf{E}[F(\widetilde{x}_t)-F(x_*)] \leqslant \dfrac{6\|r_0\|^2}{W_T}\biggl(\dfrac{1}{\gamma}+\dfrac{24\mu}{121}\biggr)+ \dfrac{6\gamma\sigma^2}{M}+6C_0H\gamma^2\sigma^2. \end{equation*} \notag $$

Choosing $\gamma$ to be

$$ \begin{equation*} \min\biggl\{\dfrac{\mu}{3\rho L^2}\,,\dfrac{1}{6L}\,, \dfrac{\sqrt{\mu}}{\sqrt{6CH\rho L^2}}\,, \dfrac{\log(\max\{2,\mu^2\|r_0\|^2T^2M/\sigma^2\})}{2\mu T}\biggr\} \end{equation*} \notag $$

and using that

$$ \begin{equation*} W_T \geqslant w_{T-1}=\biggl(1-\dfrac{\gamma\mu}{2}\biggr)^{-T}\geqslant \exp\biggl(\dfrac{\gamma\mu T}{2}\biggr), \end{equation*} \notag $$

we see that if $\gamma$ is equal to the first, second, or third option, then

$$ \begin{equation*} \begin{aligned} \, \mathsf{E}[F(\widetilde{x}_t)-F(x_*)]&=\widetilde{\mathcal{O}}\biggl(\|r_0\|^2 \max\biggl\{\dfrac{\rho L^2}{\mu}\,,L,\dfrac{\sqrt{(\mu+L_R)H\rho L^2}} {\sqrt{\mu}}\biggr\} \\ &\qquad\times\exp\biggl(-\mu T\min\biggl\{\dfrac{\mu}{\rho L^2}\,, \dfrac{1}{L}\,,\dfrac{\sqrt{\mu}}{\sqrt{(\mu+L_R)H\rho L^2}}\biggr\}\biggr) \\ &\qquad+\dfrac{\sigma^2}{\mu M T}+\dfrac{\rho L^2H\sigma^2}{M\mu^3T^3}+ \dfrac{L_RH\sigma^2}{\mu^2 T^2}\biggr) \end{aligned} \end{equation*} \notag $$

(we have substituted in $C_0$ from (39)). Moreover, the fourth choice of $\gamma$ gives

$$ \begin{equation*} \mathsf{E}[F(\widetilde{x}_t)-F(x_*)]= \widetilde{\mathcal{O}}\biggl(\dfrac{\sigma^2}{\mu M T}+ \dfrac{\rho L^2H\sigma^2}{M\mu^3T^3}+\dfrac{L_RH\sigma^2}{\mu^2 T^2}\biggr). \end{equation*} \notag $$

Combining the above two upper bounds we finish the proof of Theorem 3. $\Box$



Bibliography

1.	M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and Li Zhang, “Deep learning with differential privacy”, Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, ACM, New York, 2016, 308–318
2.	D. Basu, D. Data, C. Karakus, and S. N. Diggavi, “Qsparse-local-SGD: distributed SGD with quantization, sparsification and local computations”, NIPS'19: Proceedings of the 33rd international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 32, Curran Associates, Inc., Red Hook, NY, 2019, 1316, 14695–14706 https://proceedings.neurips.cc/paper_files/paper/2019/hash/d202ed5bcfa858c15a9f383c3e386ab2-Abstract.html
3.	A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, and A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2762, 38116–38133 https://proceedings.neurips.cc/paper_files/paper/2022/hash/f9379afacdbabfdc6b060972b60f9ab8-Abstract-Conference.html
4.	A. Beznosikov, V. Samokhin, and A. Gasnikov, Distributed saddle-point problems: lower bounds, near-optimal and robust algorithms, 2022 (v1 – 2020), 52 pp., arXiv: 2010.13112
5.	A. Beznosikov, G. Scutari, A. Rogozin, and A. Gasnikov, “Distributed saddle-point problems under similarity”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 625, 8172–8184 https://proceedings.neurips.cc/paper/2021/hash/44e65d3e9bc2f88b2b3d566de51a5381-Abstract.html
6.	A. Beznosikov, M. Takác, and A. Gasnikov, “Similarity, compression and local steps: three pillars of efficient communications for distributed variational inequalities”, NIPS'23: Proceedings of the 37th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 36, Curran Associates, Inc., Red Hook, NY, 2023, 1246, 28663–28677 https://proceedings.neurips.cc/paper_files/paper/2023/hash/5b4a459db23e6db9be2a128380953d96-Abstract-Conference.html
7.	S. Chezhegov, S. Skorik, N. Khachaturov, D. Shalagin, A. Avetisyan, A. Beznosikov, M. Takáč, Y. Kholodov, and A. Beznosikov, Local methods with adaptivity via scaling, 2024, 41 pp., arXiv: 2406.00846
8.	S. Ghadimi and Guanghui Lan, “Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. II. Shrinking procedures and optimal algorithms”, SIAM J. Optim., 23:4 (2013), 2061–2089
9.	M. R. Glasgow, Honglin Yuan, and Tengyu Ma, “Sharp bounds for federated averaging (local SGD) and continuous perspective”, Proceedings of the 25th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 151, 2022, 9050–9090 https://proceedings.mlr.press/v151/glasgow22a
10.	I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, Adapt. Comput. Mach. Learn., MIT Press, Cambridge, MA, 2016, xxii+775 pp.
11.	E. Gorbunov, F. Hanzely, and P. Richtárik, “Local SGD: unified theory and new efficient methods”, Proceedings of the 24th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 130, 2021, 3556–3564 https://proceedings.mlr.press/v130/gorbunov21a.html
12.	H. Hendrikx, Lin Xiao, S. Bubeck, F. Bach, and L. Massoulie, “Statistically preconditioned accelerated gradient method for distributed optimization”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 4203–4227 https://proceedings.mlr.press/v119/hendrikx20a.html
13.	P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et. al., “Advances and open problems in federated learning”, Found. Trends Mach. Learn., 14:1-2 (2021), 1–210
14.	S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5132–5143 https://proceedings.mlr.press/v119/karimireddy20a.html
15.	A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local SGD on identical and heterogeneous data”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 4519–4529, https://proceedings.mlr.press/v108/bayoumi20a.html
16.	A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5381–5393 https://proceedings.mlr.press/v119/koloskova20a.html
17.	J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, Federated learning: strategies for improving communication efficiency, 2017 (v1 – 2016), 10 pp., arXiv: 1610.05492
18.	D. Kovalev, A. Beznosikov, E. Borodich, A. Gasnikov, and G. Scutari, “Optimal gradient sliding and its application to optimal distributed optimization under similarity”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2427, 33494–33507 https://proceedings.neurips.cc/paper_files/paper/2022/hash/d88f6f81e1aaf606776ffdd06fdf24ef-Abstract-Conference.html
19.	Tian Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks”, Proc. Mach. Learn. Syst., 2 (2020), 429–450
20.	Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, and Yifei Cheng, Variance reduced local SGD with lower communication complexity, 2019, 25 pp., arXiv: 1912.12844
21.	L. O. Mangasarian, “Parallel gradient distribution in unconstrained optimization”, SIAM J. Control Optim., 33:6 (1995), 1916–1925
22.	B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data”, Proceedings of the 20th international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 54, 2017, 1273–1282 https://proceedings.mlr.press/v54/mcmahan17a
23.	K. Mishchenko, G. Malinovsky, S. Stich, and P. Richtárik, “Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally!”, Proceedings of the 39th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 162, 2022, 15750–15769 https://proceedings.mlr.press/v162/mishchenko22b
24.	A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization”, Proceedings of the 23rd international conference on artificial intelligence and statistics, Proc. Mach. Learn. Res. (PMLR), 108, 2020, 2021–2031 https://proceedings.mlr.press/v108/reisizadeh20a.html
25.	H. Robbins and S. Monro, “A stochastic approximation method”, Ann. Math. Statistics, 22 (1951), 400–407
26.	M. Schmidt and N. Le Roux, Fast convergence of stochastic gradient descent under a strong growth condition, 2013, 5 pp., arXiv: 1308.6370
27.	S. Shalev-Shwartz and S. Ben-David, Understanding machine learning. From theory to algorithms, Cambridge Univ. Press, Cambridge, 2014, xvi+397 pp.
28.	O. Shamir, N. Srebro, and Tong Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method”, Proceedings of the 31st international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 32, 2014, 1000–1008 https://proceedings.mlr.press/v32/shamir14.html
29.	P. Sharma, S. Kafle, P. Khanduri, S. Bulusu, K. Rajawat, and P. K. Varshney, Parallel restarted SPIDER – communication efficient distributed nonconvex optimization with optimal computation complexity, 2020 (v1 – 2019), 25 pp., arXiv: 1912.06036
30.	A. Spiridonoff, A. Olshevsky, and Y. Paschalidis, “Communication-efficient SGD: from local SGD to one-shot averaging”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 1861, 24313–24326 https://proceedings.neurips.cc/paper_files/paper/2021/hash/cc06a6150b92e17dd3076a0f0f9d2af4-Abstract.html
31.	S. U. Stich, Local SGD converges fast and communicates little, 2019 (v1 – 2018), 19 pp., arXiv: 1805.09767
32.	J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning”, ACM Comput. Surveys, 53:2 (2020), 30, 1–33
33.	Jianyu Wang, V. Tantia, N. Ballas, and M. Rabbat, SlowMo: Improving communication-efficient distributed SGD with slow momentum, 2020 (v1 – 2019), 27 pp., arXiv: 1910.00643
34.	B. Woodworth, K. K. Patel, S. Stich, Zhen Dai, B. Bullins, B. Mcmahan, O. Shamir, and N. Srebro, “Is local SGD better than minibatch SGD?”, Proceedings of the 37th international conference on machine learning, Proc. Mach. Learn. Res. (PMLR), 119, 2020, 10334–10343 https://proceedings.mlr.press/v119/woodworth20a
35.	Honglin Yuan and Tengyu Ma, “Federated accelerated stochastic gradient descent”, NIPS'20: Proceedings of the 34th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 33, Curran Associates, Inc., Red Hook, NY, 2020, 448, 5332–5344 https://proceedings.neurips.cc/paper_files/paper/2020/hash/39d0a8908fbe6c18039ea8227f827023-Abstract.html
36.	Jian Zhang, C. De Sa, I. Mitliagkas, and C. Ré, Parallel SGD: when does averaging help?, 2016, 13 pp., arXiv: 1606.07365

Citation: A. E. Sadchikov, S. A. Chezhegov, A. N. Beznosikov, A. V. Gasnikov, “Local SGD for near-quadratic problems: Improving convergence under unconstrained noise conditions”, Russian Math. Surveys, 79:6 (2024), 1017–1049

Citation in format AMSBIB

\Bibitem{SadCheBez24}

\by A.~E.~Sadchikov, S.~A.~Chezhegov, A.~N.~Beznosikov, A.~V.~Gasnikov

\paper Local SGD for near-quadratic problems: Improving convergence under unconstrained noise conditions

\jour Russian Math. Surveys

\yr 2024

\vol 79

\issue 6

\pages 1017--1049

\mathnet{http://mi.mathnet.ru/eng/rm10207}

\crossref{https://doi.org/10.4213/rm10207e}

\mathscinet{https://mathscinet.ams.org/mathscinet-getitem?mr=4867089}

\adsnasa{https://adsabs.harvard.edu/cgi-bin/bib_query?2024RuMaS..79.1017S}

\isi{https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=Publons&SrcAuth=Publons_CEL&DestLinkType=FullRecord&DestApp=WOS_CPL&KeyUT=001443210000008}

\scopus{https://www.scopus.com/record/display.url?origin=inward&eid=2-s2.0-105000427396}

Linking options:

https://www.mathnet.ru/eng/rm10207

https://doi.org/10.4213/rm10207e

https://www.mathnet.ru/eng/rm/v79/i6/p83

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Statistics & downloads:
Abstract page:	580
Russian version PDF:	17
English version PDF:	40
Russian version HTML:	36
English version HTML:	272
References:	52
First page:	34

Registration to the website

Logotypes