Clustering algorithms in the service of Data Mining

This article is an attempt to systematize and provide a holistic overview of recent advances in the development of effective approaches to clustering data. It is not intended to describe all clustering algorithms in detail. Rather, the overview character of the article and the topics addressed will help to navigate the wide variety of clustering algorithms.


Clustering - grouping similar objects together - is one of the fundamental tasks in data analysis and data mining. The list of application areas is long: image segmentation, marketing, fraud detection, prediction, text analysis, and many others. Clustering is often the first step in data analysis today. After identifying similar groups, other methods are applied and a separate model is created for each group.

The task of clustering has been formulated in one form or another in scientific fields such as statistics, pattern recognition, optimization, and machine learning. As a result, there are a variety of synonyms for the term "cluster" - class, taxon, agglomeration.

To date, the number of methods for partitioning groups of objects into clusters is quite large - several dozen algorithms and even more their modifications. However, we are interested in clustering algorithms in terms of their application in data mining.

Clustering in Data Mining

Clustering in data mining gains value when it serves as one of the stages of data analysis to build a complete analytical solution. It is often easier for an analyst to select groups of similar objects, examine their characteristics, and build a separate model for each group than to develop a general model for all data. This technique is used in marketing all the time to identify groups of customers, buyers and products and to develop a separate strategy for each of them.

The data processed by data mining technology very often has the following important characteristics:

High dimensionality (thousands of fields) and high volume (hundreds of thousands and millions of records) of database tables and data warehouses;

Data sets contain a large number of numeric and categorical attributes.

All attributes or properties of objects are divided into numeric and categorical. Numeric attributes are those that can be ordered in space, or categorical attributes that cannot be ordered. For example, the attribute "age" is numeric, while "color" is categorical. Assigning values to attributes is done during measurements by the selected scale type and is generally a separate task.

In most clustering algorithms, objects are compared based on some measure of proximity (similarity). A distance measure is a value that has a threshold and increases as the distance between objects increases. Proximity measures are "invented" according to certain rules, and the choice of specific measures depends on the task and the scale of the measurement. Euclidean distance, which is calculated by the formula, is very often used as an approximate measure for numerical attributes:

D(x,y)=i(xy)2D(x, y)=\sqrt{\sum_{i}(x-y)^2}

For category attributes, the Sørensen-Czekanowski coefficient and the Jaccard coefficient are common:

(t1t2/t1t2)(\left|t_{1}\cap t_{2} \right|/\left|t_{1}\cup t_{2} \right|)

The need to process large data sets in data mining has led to the formulation of requirements that the clustering algorithm should meet, if possible. Consider:

  1. Smallest possible number of database iterations;

  2. Operation in a limited amount of computer memory;

  3. The operation of the algorithm can be interrupted, saving the intermediate results to continue the calculations later;

  4. The algorithm should work when objects can be retrieved from the database only in unidirectional cursor mode (i.e., record navigation mode).

An algorithm that satisfies these requirements (especially the second one) is called scalable. Scalability is the most important property of an algorithm, which depends on its computational complexity and software implementation. However, there is also a broader definition. An algorithm is said to be scalable if its uptime increases linearly with the increase in the number of data records in the database, while its storage capacity remains the same.

However, it is not always necessary to process very large data fields. For this reason, in the early days of cluster analysis theory, issues of algorithm scalability were almost neglected. It was assumed that all the processed data would fit in the main memory, the main focus was always on improving the quality of clustering.

It is difficult to find a balance between high clustering quality and scalability. Therefore, data mining should ideally have both efficient algorithms for clustering microarrays and scalable algorithms for processing large databases.

Clustering algorithms: Splendor and miseries

So, it is already possible to divide clustering algorithms into scalable and non-scalable. Let us proceed with the classification.

Depending on the method of clustering, there are two types of algorithms: hierarchical and non-hierarchical. Classical hierarchical algorithms work only for category attributes when a complete tree of nested clusters is constructed. Here, agglomerative methods for building cluster hierarchies are common - they perform a sequential association of initial objects and a corresponding reduction in the number of clusters. Hierarchical algorithms provide relatively high quality clustering and do not require prior specification of the number of clusters. Most of them have a complexity of O(n2)O(n^{2}).

Non-hierarchical algorithms are based on the optimization of an objective function that determines the optimal partitioning of a set of objects into clusters in a particular sense. In this group, algorithms of the k-means family (k-means, fuzzy c-means, Gustafson-Kessel) are common, which use as objective function the sum of squares of the weighted deviations of the object coordinates from the centers of the sought clusters.

The clusters are either spherical or ellipsoidal. In the canonical implementation, the function minimization is based on the method of Lagrange multipliers and allows to find only the nearest local minimum. The use of global search methods (genetic algorithms) significantly increases the computational complexity of the algorithm.


Among the non-hierarchical, non-distance-based algorithms, the EM (Expectation-Maximization) algorithm is noteworthy. Instead of cluster centers, a probability density function with corresponding expected value and variance is assumed for each cluster. The distribution mixture (figure below) is analyzed for its parameters (means and standard deviations) according to the maximum likelihood principle. The EM algorithm is an implementation of such a search. The problem is that before starting the algorithm, a hypothesis is made about the nature of the distributions, which are difficult to estimate in the whole data set.


Another problem occurs when the attributes of an object are mixed - one part is of numeric type and the other part is of category type. For example, let's calculate the distance between the following objects with attributes (age, gender, education level):

{23, male, college education}, (1)

{25, female, high school diploma}, (2)

The first attribute is numeric, the others are categorical. If we want to use a classical hierarchical algorithm with any similarity measure, we need to discredit the "age" attribute in some way. For example, like this:

{under 30, male, college education}, (1)

{under 30, female, high-school diploma}, (2)

In the process, we will certainly lose some information. If we define the distance in the Euclidean space, there are problems with the category attributes. It is clear that the distance between "gender male" and "gender female" is 0, because the values of this attribute are on the naming scale. And the attribute "educational level" can be measured either on the naming scale or on the ordering scale by assigning certain points to each value.

Which option is the right one? What if categorical attributes are more important than numerical attributes? It is up to the analyst to solve these problems. Moreover, when using the k-means algorithm and other similar algorithms, there are difficulties in understanding the cluster centers of the category attributes, in setting the number of clusters as a default.

The optimization algorithm for the objective function in non-hierarchical distance-based algorithms is iterative, and at each iteration the distance matrix between objects must be calculated. For a large number of objects, this is inefficient and requires significant computational resources. The computational complexity of the first iteration of the k-means algorithm is estimated to be O(kmn), where k, m, and n are the number of clusters, attributes, and objects, respectively. But there can be a huge number of iterations! The dataset has to be iterated several times.

The idea of k-means to find spherical or elliptical clusters has many disadvantages. The approach works well when the data form compact clumps in space that are easily distinguishable from each other. However, when the data is nested, none of the k-means family algorithms will do the job. The algorithm also does not work well when one cluster is much larger than the others and they are close to each other - there is a "splitting" effect of a large cluster (Fig. below).


However, research to improve clustering algorithms is ongoing. The K-Means algorithm has developed interesting extensions to handle category attributes (K-Modes) and mixed attributes (K-Prototypes). For example, K-Prototypes computes distances between objects differently depending on the attribute type.

In the area of scalable clustering algorithms, the goal is to reduce any "extra" pass through the dataset during the execution of the algorithm. Scalable K-Means and scalable EM analogs have been developed, as well as scalable agglomerative methods (CURE, CACTUS). These state-of-the-art algorithms require only a few (two to ten) database queries before obtaining the final clustering.

The derivation of scalable algorithms is based on the idea of omitting the local optimization function. The pairwise comparison of objects in the K-Means algorithm is nothing more than local optimization, since the distance from the cluster center to each object must be calculated at each iteration. This leads to a high computational cost.

With the global optimization function, adding a new point to the cluster is not computationally intensive: it is computed based on the old value, the new object, and the so-called cluster features. The specific cluster features depend on the particular algorithm. This is how BIRCH, LargeItem, CLOPE and many other algorithms were created.

Thus, there is no single universal clustering algorithm. When using an algorithm, it is important to know its strengths and weaknesses, and to consider the type of data with which it works best, as well as its scalability.