logistic regression thumbWe have previously presented a proposed implementation for the k-means algorithm on computed samples. We saw that the result performed quite well clustering our samples. The next step now is to evaluate the probablility that a new sample belongs to one or another cluster. To do so, we'll use the logistic regression. This algorithm is a special case of generalized linear model but it has no explicit formal solution to the to the least mean square (LMS) just as a linear regression does. So instead of trying to minimize the cost/error funtion, the goal here is to maximize the likehood function. This algorithm is a binomial regssion algorithm and is often used to separate two distinct groups of observations and estimate the probability of a new sample to belong to one or the other group. In our case, we have several groups (exactly k clusters) so the regression must be done several times. For each cluster, we'll perform the logistic regression on the cluster vs. all other clusters, and we'll repeat the operation for each cluster. That's the meaning of this part of the code seen our k-means post.

The core of this source code sample is the minimization of the cost function in its regularized form which is given by

Finally, the result has to be displayed well, which is the meaning of the following function


The result for the boundaries computation is given below


Boundaries with the initial samples
Boundaries in the normalized computed clusters result


For further details, see the Wikipedia page on logistic regression