K-means clustering isn’t the best clustering algorithm out there. It isn’t highly accurate because K-means assumes all clusters to be even in size and spherical in shape. K-means also can’t handle complex geometry. It is specifically useful only for flat geometry clusters. And outliers, if present, can have a major influence on the output clusters. Yet, K-means has been popular for ages and is widely used in marketing and customer domains. This popularity has puzzled me for sometime. And I have now reached a point where I know why.
#1 High Accuracy Is Often Not a Business Requirement for Clustering
In most of the use-cases where clustering is used, the business requirement would be quite straight forward. It would be something like ‘we need four distinct customer segments’ or ‘we need to divide the delivery zones’ or ‘we need to group the images’ etc. In these cases, the business requires only guidance in form of grouped data so that they can divert business processes accordingly. K-means is good enough to segment the data into decent and useful clusters as long as the input data doesn’t have too many missing values.
When high accuracy is not critical, then feasibility of implementation would be the next key factor for selecting an ML model.
#2 K-Means Is Simple to Implement
K-mean is easy to understand and implement cause it is based on two simple concepts — centroid movement and cluster assignments. But of course, you don’t have to code it from scratch. The general practice is to simply utilize the K-means implementation from TensorFlow, scikit-learn, or any other framework where the core functionality is already available. Also, the input to the algorithm constitutes only the data, and (in general) just two configurations namely ‘the number of clusters’ and ‘the number of iterations’.
#3 It Has Less Overhead
K-means isn’t memory intensive. This means the infrastructure need for execution will be more dependent…