Updated on October 31, 2015

# Gradient Descent Optimization

In many machine learning algorithm, the goal is to find a function or parameters that allows us to approximate or modelize unknown observable data. Those data could come from device measurement, web crawling, empirical observations etc. Generally speaking we have samples of the observation vector . For example such a vector could be the coordinates of an object in space. We want to approximate these data with some parameters through a known function so that

One common way to do that is trying to minimize an error function, for example the root mean square (RMS)

# Linear Regression Example

Without loss of generality, we'll focus on the simple case of linear regression. For example, imagine you have a dataset of points and you want to fit a line

The regression function have then the form of

but more generally for a linear regression this function will have the form of

where is the input vector, the vector on which we want to regress, the observable variables, and are the regression parameters. To be more generic, we can also define vector to be with so that the regression function become now

We are looking for the vector that minimizes the error function . The error function is continuous and differentiable so that at the minimal error, we have

As there is no analytic solution of this optimization problem in the general case, one technic is to iteratively update the weights in the direction of the gradient

where is the learning rate. The learning rate has to be small enough not to pass trough the minimum or oscillate around it, but large enough not to converge rapidely. In the special case of linear regression with a polynom of order 2, this takes the form

Practically, we don't look for that satisfies , but we iterate over as long as the vector significantly changes, or we can also fix the number of iterations. I propose here an implementation of such an algorithm in Python

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
import numpy as np import matplotlib.pyplot as plt import time def AddUnityVector(X): m = np.array(X).shape[0] X = np.column_stack((np.ones(m), X)) return X def Normalize(X): mu = np.mean(X, axis = 0) sigma = np.std(X, axis = 0) X = (X-mu) / sigma return X def ComputeCost(X, y, theta): m = len(y) h = np.dot(X, theta) J = sum((h-y)**2)/(2*m) return J def ProcessGradient(X, y, theta, alpha): m = len(y) n = len(theta) factor = alpha / m h = np.dot(X, theta) theta = [theta[i] - factor * np.sum((h-y) * X[:,i]) for i in range(n)] return theta def ComputeGradientDescent(x, y, iter): alpha = 0.01 #Add unity vector for bias x = AddUnityVector(x) #First guess for theta s = x.shape n = 0 m = s[0] if len(s) == 1: n = 1 else: n = s[1] theta = np.zeros(n) for i in range(0, iter): theta = ProcessGradient(x, y, theta, alpha) return theta if __name__ == '__main__': x = np.linspace(0, 10, 100) y = 3*x+2+x*np.random.randn(100) theta = ComputeGradientDescent(x, y, 1500) #Plot the results x = AddUnityVector(x) fig = plt.figure() ax = fig.add_subplot(111) ax.plot(x[:,1], y, 'b.') h = np.dot(x, theta) data = np.column_stack((x[:,1], h)) #Sorting data = data[data[:,0].argsort()] ax.plot(data[:,0], data[:,1], 'r') ax.set_xlabel('X') ax.set_ylabel('Y') fig.suptitle('y=3x+2 approxmiated with y=%.2fx+%.2f'%(theta[1],theta[0]),fontsize=14) plt.show() |

## Recent Comments