# Gradient Descent Optimization

In many machine learning algorithm, the goal is to find a function or parameters that allows us to approximate or modelize unknown observable data. Those data could come from device measurement, web crawling, empirical observations etc. Generally speaking we have $n$ samples $\boldsymbol y = (\boldsymbol{y_1}, \boldsymbol{y_2}..., \boldsymbol{y_n})$ of the observation vector $\boldsymbol{y_i} = (y_i^{(1)}, y_i^{(2)} ..., y_i^{(m)})$. For example such a vector $\boldsymbol y_i$ could be the coordinates $(x, y, z)$ of an object in space.  We want to approximate these data with some parameters $\boldsymbol\theta_e$ through a known function $h$ so that

$\boldsymbol h_{\theta_e}(\boldsymbol x) \approx \boldsymbol y$

One common way to do that is trying to minimize an error function, for example the root mean square (RMS)

$J(\boldsymbol{\theta})=\frac{1}{2}RMS = \frac{1}{2m}\sum \limits_{i=1}^m \bigg(h_\theta(\boldsymbol {x}^{(i)}) -y^{(i)}\bigg)^2$

# Linear Regression Example

Without loss of generality, we'll focus on the simple case of linear regression. For example, imagine you have a dataset  of $(x, y)$ points and you want to fit a line

The regression function have then the form of

$h_{\theta}(\boldsymbol {x})=\theta_0+\theta_1x$

but more generally for a linear regression this function will have the form of

$h_{\theta}(\boldsymbol {x})=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n$

where $\boldsymbol x=(x_1, x_2..., x_n)$ is the input vector, the vector on which we want to regress, the observable variables, and $\boldsymbol \theta=(\theta_0, \theta_1..., \theta_n)$ are the regression parameters. To be more generic, we can also define vector $\boldsymbol x$ to be $\boldsymbol x=(x_0, x_1, x_2..., x_n)$ with $x_0 = 1$ so that the regression function become now

$h_{\theta}(\boldsymbol {x})=\theta^T\boldsymbol {x}$

We are looking for the vector $\boldsymbol{\theta}$ that minimizes the error function $J(\boldsymbol{\theta})$. The error function is continuous and differentiable so that at the minimal error, we have

$\nabla J(\boldsymbol{\theta}) = 0$

As there is no analytic solution of this optimization problem in the general case, one technic is to iteratively update the weights in the direction of the gradient

$\boldsymbol{\theta}^{t+1} = \boldsymbol\theta^t - \alpha\nabla J(\boldsymbol\theta)$

where $\alpha$ is the learning rate. The learning rate has to be small enough not to pass trough the minimum or oscillate around it, but large enough not to converge rapidely. In the special case of linear regression with a polynom of order 2, this takes the form

$\theta_j^{t+1} = \theta_j^t - \alpha\frac{1}{m}\sum_{i=1}^m \big(h_\theta(\boldsymbol{x}^{(i)}) -y^{(i)}\big)x_j^{(i)}$

Practically, we don't look for $\boldsymbol\theta$ that satisfies $J(\boldsymbol\theta)=0$, but we iterate over $t$ as long as the vector $\boldsymbol\theta$ significantly changes, or we can also fix the number of iterations. I propose here an implementation of such an algorithm in Python