"Adagrad has the natural effect of decreasing the effective step size as a function of time. Perhaps you have good reason to use your own step-size decrease schedule. In this case, you can use a running average of the historical gradient instead of a sum."[1]

AdaGrad and L1-regularization Edit

From "Notes on AdaGrad" (Chris Dyer):

Directly applying stochastic subgradient descent to an l1 regularized objective fails to produce sparse solutions in bounded time, which has motivated several specialized algorithms that target such objectives. We will use the AdaGrad variant of one such learning algorithm, the so-call regularized dual averaging algorithm of Xiao (2010), although other approaches are possible.

Community content is available under CC-BY-SA unless otherwise noted.