We have previously presented a proposed implementation for the k-means algorithm on computed samples. We saw that the result performed quite well clustering our samples. The next step now is to evaluate the probablility that a new sample belongs to one or another cluster. To do so, we'll use the logistic regression. This algorithm is a special case of generalized linear model but it has no explicit formal solution to the to the least mean square (LMS) just as a linear regression does. So instead of trying to minimize the cost/error funtion, the goal here is to maximize the likehood function. This algorithm is a binomial regssion algorithm and is often used to separate two distinct groups of observations and estimate the probability of a new sample to belong to one or the other group. In our case, we have several groups (exactly k clusters) so the regression must be done several times. For each cluster, we'll perform the logistic regression on the cluster vs. all other clusters, and we'll repeat the operation for each cluster. That's the meaning of this part of the code seen our k-means post.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
% Compute the boundaries by logistic regression for var = 1:k % Reassign y values yt = (y ~= var); % Add Polynomial Features Xt = polynomialCombinaisons(X(:,1), X(:,2)); % Initialize fitting parameters initial_theta = zeros(size(Xt, 2), 1); % Set regularization parameter lambda to 1 (you should vary this) lambda = 0.001; % Set Options options = optimset('GradObj', 'on', 'MaxIter', 100); % Optimize [theta] = fmincg (@(t)(costFunctionReg(t, Xt, yt, lambda)), ... initial_theta, options); % Plot Boundary plotDecisionBoundary(theta, Xt, yt, findobj('Name','Result center'), Xmean, Xstd); plotDecisionBoundary(theta, Xt, yt, findobj('Name','Result frontier'), [0 0], [1 1]); end |

The core of this source code sample is the minimization of the cost function in its regularized form which is given by

1 2 3 4 5 6 7 8 9 10 |
function [J, grad] = costFunctionReg(theta, X, y, lambda) % Compute cost and gradient for logistic regression with regularization m = length(y); % number of training examples J = 0; grad = zeros(size(theta)); h = sigmoid(X*theta)+1e-15; thetaReg = [0; theta(2:end)]; J = (-y'*log(h)-(ones(m,1)-y)'*log(ones(m,1)-h)) / m + lambda * thetaReg' * thetaReg/ (2*m); grad = X' * (h-y) / m + thetaReg * lambda / m; end |

Finally, the result has to be displayed well, which is the meaning of the following function

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
function plotDecisionBoundary(theta, X, y, figHandle, Xmean, Xstd) % Plots the data points X and y into a new figure with % the decision boundary defined by theta figure(figHandle) hold on % Here is the grid range- u = linspace(-2, 2, 50); v = linspace(-2, 2, 50); z = zeros(length(u), length(v)); % valuate z = theta*x over the grid for i = 1:length(u) for j = 1:length(v) z(i,j) = mapFeature(u(i), v(j))*theta; end end z = z'; % important to transpose z before calling contour % Plot z = 0 contour(u*Xstd(1)+Xmean(1), v*Xstd(2)+Xmean(2), z, [0, 0], 'LineWidth', 2) hold off end |

The result for the boundaries computation is given below

Boundaries with the initial samples

Boundaries in the normalized computed clusters result

For further details, see the Wikipedia page on logistic regression