Gradient Descent Optimization


In many machine learning algorithm, the goal is to find a function or parameters that allows us to approximate or modelize unknown observable data. Those data could come from device measurement, web crawling, empirical observations etc. Generally speaking we have n samples \boldsymbol y = (\boldsymbol{y_1}, \boldsymbol{y_2}..., \boldsymbol{y_n}) of the observation vector \boldsymbol{y_i} = (y_i^{(1)}, y_i^{(2)} ..., y_i^{(m)}). For example such a vector \boldsymbol y_i could be the coordinates (x, y, z) of an object in space.  We want to approximate these data with some parameters \boldsymbol\theta_e through a known function h so that

\boldsymbol h_{\theta_e}(\boldsymbol x) \approx \boldsymbol y

One common way to do that is trying to minimize an error function, for example the root mean square (RMS)

J(\boldsymbol{\theta})=\frac{1}{2}RMS = \frac{1}{2m}\sum \limits_{i=1}^m \bigg(h_\theta(\boldsymbol {x}^{(i)}) -y^{(i)}\bigg)^2


 Linear Regression Example

Without loss of generality, we'll focus on the simple case of linear regression. For example, imagine you have a dataset  of (x, y) points and you want to fit a line


The regression function have then the form of

h_{\theta}(\boldsymbol {x})=\theta_0+\theta_1x

but more generally for a linear regression this function will have the form of

h_{\theta}(\boldsymbol {x})=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n

where \boldsymbol x=(x_1, x_2..., x_n) is the input vector, the vector on which we want to regress, the observable variables, and \boldsymbol \theta=(\theta_0, \theta_1..., \theta_n) are the regression parameters. To be more generic, we can also define vector \boldsymbol x to be \boldsymbol x=(x_0, x_1, x_2..., x_n) with x_0 = 1 so that the regression function become now

h_{\theta}(\boldsymbol {x})=\theta^T\boldsymbol {x}

We are looking for the vector \boldsymbol{\theta} that minimizes the error function J(\boldsymbol{\theta}). The error function is continuous and differentiable so that at the minimal error, we have

\nabla J(\boldsymbol{\theta}) = 0

As there is no analytic solution of this optimization problem in the general case, one technic is to iteratively update the weights in the direction of the gradient

\boldsymbol{\theta}^{t+1} = \boldsymbol\theta^t - \alpha\nabla J(\boldsymbol\theta)

where \alpha is the learning rate. The learning rate has to be small enough not to pass trough the minimum or oscillate around it, but large enough not to converge rapidely. In the special case of linear regression with a polynom of order 2, this takes the form

\theta_j^{t+1} = \theta_j^t - \alpha\frac{1}{m}\sum_{i=1}^m \big(h_\theta(\boldsymbol{x}^{(i)}) -y^{(i)}\big)x_j^{(i)}

Practically, we don't look for \boldsymbol\theta that satisfies J(\boldsymbol\theta)=0, but we iterate over t as long as the vector \boldsymbol\theta significantly changes, or we can also fix the number of iterations. I propose here an implementation of such an algorithm in Python


Leave a Reply

%d bloggers like this: