Customer Segmentation and Clustering

Customer segmentation is based on the assumption that customer behavior is both different from each other and exhibits certain regular patterns. Customer segmentation is an effort to meaningfully divide the customer base into groups so that customers associated in the same group are characterized by similar behavior in certain aspects. Customer segmentation is an effort to divide the customer base into groups according to certain customer attributes.

Aims:

Deeper customer understanding
Targeted marketing
optimal product placement
Search for latent customer segments
Higher sales

k-diameter cluster analysis

One of the techniques for customer segmentation is cluster analysis. The term cluster analysis hides a number of different methods and approaches, and it depends on the specific case which method is most appropriate to use. However, it is always necessary to select the properties on the basis of which the analysis will be carried out. In the case of customer segmentation, purchasing behavior is offered, but demographic, psychographic or geographic information can also be used, for example.

In this article, we will try to demonstrate how such customer segmentation can take place using K-means analysis. For data analysis, there are now a number of tools from special programs such as. Excel to various programming languages (R, Python, Julia, Scala, …) and suitable libraries. We decided on Python. It is a general programming language that is also widely used for scientific calculations and work with data. Due to its popularity and wide spread, there are a number of reliable and verified libraries.

The algorithm for k-diameter cluster analysis is quite simple and efficient. Its basic assumption is that all attributes of individual data records are numeric values. Its essence lies in the fact that the distances between all data records are calculated from each other, and those records that are “close” to each other form a cluster – segment. The user must decide how many groups they want to split the data into. This decision is somewhat arbitrary, and knowledge of the relevant type of business and customers is reflected in its determination. Even so, an analysis is often carried out for different numbers of groups, and based on the analysis of the results, the optimal number of segments is ultimately determined. The algorithm looks like this:

Step 1: The user decides how many groups to split the data into.
Step 2: The user randomly selects C₁…, C_k as cluster centers – centroids.

image - Customer Segmentation and Clustering

Step 3: The nearest centroid is found for each record. So, each centroid “owns” a subset of records, and therefore we have the data divided into clusters.

image 2 - Customer Segmentation and Clustering

Step 4: For each cluster, its “center of gravity” is calculated, and these centers of gravity are now the new centers of clusters – centroids.

image 3 - Customer Segmentation and Clustering

Step 5: Steps 3-5 are repeated so long the calculated center of gravity and centroids differ

image 4 - Customer Segmentation and Clustering

Data

The data we will be using can be obtained from the UCI repository for Machine Learning. The database consists of a list of transactions of an online retail store based in the UK, which specializes in gifts for all possible occasions. A large part of the customers are wholesalers. The specific link is here – http://archive.ics.uci.edu/ml/datasets/online+retail. On the linked page you can find more detailed information about the data.

The database that we will use contains, apparently to maintain the anonymity of customers, only information about transactions of corporate customers, there are no other attributes in it.

image 5 - Customer Segmentation and Clustering

We can see that the data contains only 8 attributes – account number (InvoiceNo), stock number (StockCode), description (Description), quantity (Quantity), unit price (UnitPrice), customer number (CustomerID) and country of the customer (Country). This is somewhat atypical; most client databases contain more details about individual clients.

Given this situation, we will use RFM analysis, which is one of the basic methods for customer segmentation. RFM analysis is based on three essential characteristics of each customer:

Recency – i.e. the length of the period since the last purchase
Frequency – i.e. frequency of purchases
Monetary value – i.e. the total financial value of all transactions made by the customer

Now let’s create RFM attributes. Let’s start with Recency. In order to define the length since the last purchase, we need to specify a reference date. One option is to take the date of the last transaction from our database and increase it by a day. The recency attribute is the number of days between a customer’s last purchase and the reference date.

Furthermore, for each customer, we find the total number of his purchases and calculate the total financial value of his transactions. This will give us the Attributes Frequency and Monetary Value, which we will denote amount for short. We will then use this new data for cluster analysis.

image 6 - Customer Segmentation and Clustering

Plotting

When deciding on the suitability and applicability of the acquired cluster models, we will try to find answers to the following questions:

Does the segmentation correspond to reality or is it just a mathematical artifact?
What is the optimal number of search segments?
How to determine whether one group of segments is more suitable than another?

Each measure of correctness of segmentation must take into account two criteria: the rate of segment separation and the degree of segment cohesion. In other words, how far apart the individual clusters are in the model, and secondly, how closely the elements contained in the individual cluster are bound.

To answer this question, there are a number of mathematical methods, here we will use a method called silhouette.

where a_i is the distance of the ith element from its cluster center and b_i is the distance of the ith element from the next nearest cluster center.

The contour coefficient is the average of the contour_i across all the data. It is obvious that and that therefore the –1 ≤ contour_i ≤ 1 contour coefficient also lies in the interval [-1, 1]. Furthermore, it is clear that the higher the value of the contour coefficient, the higher the rate of separation and cohesion of clusters and vice versa, if the rate of separation or cohesion is low, the value of the contour coefficient is low or even negative.

Results and analysis

To calculate the contour coefficient, we can use the scikit-learn library. In addition, for a better idea, it is good to use some suitable graphic representation. We will show this in the following text for our specific data. To create the following images, we used a modified method, which the above-mentioned scikit-learn library lists as illustrative examples.

image 8 - Customer Segmentation and Clustering

image 9 - Customer Segmentation and Clustering

In the upper figure we can see the analysis performed for the number of clusters 3 on the lower, for the number of 5. Outline charts are displayed on the left side, where we can see how the outline values are distributed _even for the data in each segment. contour coefficient. We can see that the contour coefficient is higher for segmentation assuming the existence of 3 clusters (specifically 0.304) than for 5 clusters (0.29).

On the right side, we see a projection of the division into two-dimensional space. When striving for vizualizaci segmentation, it is necessary to keep in mind that each data record is generally a point in a multidimensional space. In our case, we have three attributes, therefore in the three-dimensional. When visualizing in two-dimensional space, other dimensions are therefore “flushed” and the points from the individual segments do not appear separate. (Various techniques are used to reduce the number of variables (attributes) appropriately, e.g. PCA – Principal Component Analysis. In one of the other blogs, we will discuss these methods in more detail.)

So what do the specific numbers look like and what are the values for the centers of each cluster.

image 10 - Customer Segmentation and Clustering

A look at the results offers some interesting insights. First of all, for 3 segments:

All three clusters differ significantly in Monetary value.
Segment 2 is a segment of very valuable customers who shop with high frequency.
Frequency and Recency reliably correlate with Monetary value.

When segmented into 5 segments, which according to the value of the contour coefficient is only slightly worse than 3-segmentation, it offers an interesting division of the best customers into two subgroups.

Deciding which segmentation is more appropriate will now probably have to depend a lot on the specific business practice and not just rely on statistical criteria.

As I hope you could see, the segmentation itself is a pretty simple problem. From a number of possible algorithms, we have chosen one on which the whole issue can be demonstrated. Of course, segmentation is far from answering all questions, but it does provide an additional perspective on customers and their behavior.

Finally, one important remark. The results of any such analysis shall depend primarily on the data submitted to it. If the data are scarce, inaccurate, incomplete or inhomogeneous, the conclusions of the analysis cannot be expected to be beneficial.