Unsupervised Learning Project: AllLife Bank Customer Segmentation¶

Welcome to the project on Unsupervised Learning. We will be using Credit Card Customer Data for this project.


Context¶


AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers.

Another insight from the market research was that the customers perceive the support services of the bank poorly. Based on this, the operations team wants to upgrade the service delivery model, to ensure that customers' queries are resolved faster. The head of marketing and the head of delivery, both decide to reach out to the Data Science team for help.


Objective¶


Identify different segments in the existing customer base, taking into account their spending patterns as well as past interactions with the bank.


About the data¶


Data is available on customers of the bank with their credit limit, the total number of credit cards the customer has, and different channels through which the customer has contacted the bank for any queries. These different channels include visiting the bank, online, and through a call center.

  • Sl_no - Customer Serial Number
  • Customer Key - Customer identification
  • Avg_Credit_Limit - Average credit limit (currency is not specified, you can make an assumption around this)
  • Total_Credit_Cards - Total number of credit cards
  • Total_visits_bank - Total bank visits
  • Total_visits_online - Total online visits
  • Total_calls_made - Total calls made

Importing libraries and overview of the dataset¶

Note: Please make sure you have installed the sklearn_extra library before running the below cell. If you have not installed the library, please run the below code to install the library:

!pip install scikit-learn-extra

In [6]:
!pip install scikit-learn-extra
Requirement already satisfied: scikit-learn-extra in /usr/local/lib/python3.11/dist-packages (0.3.0)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra) (1.26.4)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra) (1.15.3)
Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra) (1.7.0)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (3.6.0)
In [7]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
import warnings
warnings.filterwarnings("ignore")

Loading the data¶

In [8]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [9]:
data = pd.read_excel('/content/drive/MyDrive/MIT - Data Analytics/Elective Project/AllLife Bank/Credit Card Customer Data.xlsx')

Data Overview¶

  • Observations
  • Sanity checks
In [10]:
data.head()
Out[10]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1 87073 100000 2 1 1 0
1 2 38414 50000 3 0 10 9
2 3 17341 50000 7 1 3 4
3 4 40496 30000 5 1 1 4
4 5 47437 100000 6 0 12 3
In [11]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Sl_No                660 non-null    int64
 1   Customer Key         660 non-null    int64
 2   Avg_Credit_Limit     660 non-null    int64
 3   Total_Credit_Cards   660 non-null    int64
 4   Total_visits_bank    660 non-null    int64
 5   Total_visits_online  660 non-null    int64
 6   Total_calls_made     660 non-null    int64
dtypes: int64(7)
memory usage: 36.2 KB

Observations:

  • There are 660 entries and 7 columns in this data set
  • All the columns are of integer data type
  • None of the columns have any missing value
In [12]:
data.nunique()
Out[12]:
0
Sl_No 660
Customer Key 655
Avg_Credit_Limit 110
Total_Credit_Cards 10
Total_visits_bank 6
Total_visits_online 16
Total_calls_made 11

  • Customer Key seems to have duplicate values, this needs to be treated before anything is done.
In [13]:
duplicate_keys = data['Customer Key'].duplicated(keep=False)

data[duplicate_keys]
Out[13]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
4 5 47437 100000 6 0 12 3
48 49 37252 6000 4 0 2 8
104 105 97935 17000 2 1 2 10
332 333 47437 17000 7 3 1 0
391 392 96929 13000 4 5 0 0
398 399 96929 67000 6 2 2 2
411 412 50706 44000 4 5 0 2
432 433 37252 59000 6 2 1 2
541 542 50706 60000 7 5 2 2
632 633 97935 187000 7 1 7 0
In [14]:
data = data.drop_duplicates(subset='Customer Key', keep=False)
In [15]:
data.drop(columns = ['Sl_No', 'Customer Key'], inplace = True)
In [16]:
data[data.duplicated()]
Out[16]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
162 8000 2 0 3 4
175 6000 1 0 2 5
215 8000 4 0 4 7
295 10000 6 4 2 3
324 9000 4 5 0 4
361 18000 6 3 1 4
378 12000 6 5 2 1
385 8000 7 4 2 0
395 5000 4 5 0 1
455 47000 6 2 0 4
497 52000 4 2 1 2
In [17]:
data = data[~data.duplicated()]
data.shape
Out[17]:
(639, 5)

Data Preprocessing and Exploratory Data Analysis¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.
  • Check and drop the duplicate customer keys
  • Drop the variables that are not required for the analysis
  • Check duplicate rows and remove them.

Questions:

  1. How does the distribution and outliers look for each variable in the data?
  2. How are the variables correlated with each other?

Check the summary Statistics¶

In [18]:
data.describe().T
Out[18]:
count mean std min 25% 50% 75% max
Avg_Credit_Limit 639.0 34532.081377 37450.554493 3000.0 11000.0 18000.0 48000.0 200000.0
Total_Credit_Cards 639.0 4.699531 2.180100 1.0 3.0 5.0 6.0 10.0
Total_visits_bank 639.0 2.397496 1.620324 0.0 1.0 2.0 4.0 5.0
Total_visits_online 639.0 2.619718 2.942125 0.0 1.0 2.0 4.0 15.0
Total_calls_made 639.0 3.600939 2.870573 0.0 1.0 3.0 5.0 10.0

Observations:___________

  • The average credit limit is 34,532 with a range from 3,000 to 200,000

  • The average credit limit mean is lower than the 75th percentile (48,000)

  • The standard deviation of average credit limit is very high of 37,450 which indicates significant variation in credit limits among customers.

  • The total credit cards between customers seems to range from 1 to 10.

  • The average credit cards between customers is 4.6 which means 5.

  • The average visits to the bank is 2.4

  • the Minimum visits is 0 and maximum visits is 5 which indicates a moderate reliance on in person banking

  • The average visits online to the bank is 2.6

  • the minimum visits online is 3 and maximum is 15 which indicates more people prefer visiting online rather than in person.

  • Total call frequency varies from 0 to 10 and average is around 3.6

Scaling the data¶

In [19]:
for col in data.columns:
    print(col)
    print('Skew :', round(data[col].skew(), 2))

    plt.figure(figsize = (15, 4))
    plt.subplot(1, 2, 1)

    data[col].hist()

    plt.ylabel('count')

    plt.subplot(1, 2, 2)

    sns.boxplot(x = data[col])

    plt.show()
Avg_Credit_Limit
Skew : 2.2
No description has been provided for this image
Total_Credit_Cards
Skew : 0.16
No description has been provided for this image
Total_visits_bank
Skew : 0.15
No description has been provided for this image
Total_visits_online
Skew : 2.22
No description has been provided for this image
Total_calls_made
Skew : 0.65
No description has been provided for this image

Average credit limit

  • Most people seem to have credit limit between 0 to 25,000 with very few people having more than 25,000 limit.
  • It is completly left skewed.

Total Credit cards

  • Most people have 4 credit cards and on average they seem to have 5.
  • Boxplot has no outliers which measn it is a good distribution

Total Visits bank

  • There is a peak at 2 visits with most people going twice.
  • It seems like a normal distribution with it being fairly balanced.
  • There are no outliers.

Total visits online

  • Most number of visits online is 0-1.
  • The data is extremely left skewed which means most people visits from 0-6 times and rarely more than that.
  • The box plot seems to have wuiet a few outliers which means rarely people visit more than 8 times, but yet some do.

Total calls made

  • the distribution is fairly uniform with the peak at 4-5 calls
  • The average number of calls seems to be 3
In [20]:
plt.figure(figsize = (8, 8))

sns.heatmap(data.corr(), annot = True, fmt = '0.2f')

plt.show()
No description has been provided for this image
  • The average credit limit is negatively correlated with total calls made and total visits bank.
  • The average credit limit is positively correlated with total credit cards and total visitis online.

Applying PCA on scaled data¶

In [21]:
scaler = StandardScaler()
data_scaled = StandardScaler().fit_transform(data)
In [22]:
from sklearn.decomposition import PCA

n = data.shape[1]

pca = PCA(n_components=n)

principal_components = pca.fit_transform(data_scaled)

data_pca = pd.DataFrame(principal_components, columns = data.columns)
In [23]:
data_copy = data_pca.copy(deep = True)

K-Means¶

Let us now fit the K-means algorithm on our pca components and find out the optimum number of clusters to use.

We will do this in 3 steps:

  1. Initialize a dictionary to store the Sum of Squared Error (SSE) for each K
  2. Run for a range of Ks and store SSE for each run
  3. Plot the SSE vs K and plot the elbow curve
In [24]:
#1
sse = {}

# 2
for k in range(1, 10):
    kmeans = KMeans(n_clusters = k, max_iter = 1000, random_state = 1).fit(data_pca)
    sse[k] = kmeans.inertia_

# 3
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")

plt.show()
No description has been provided for this image
  • Interpret the above elbow plot and state the reason for choosing the particular value of K
  • Fit the K-means algorithms on the pca components with the number of clusters for the chosen value of K

The chosen value of K is 3. The reason is as this is the elbow point where the curve bends and the reduction becomes smaller. This avoids overfitting the data while still capturing the structure of the data.

In [25]:
kmeans = KMeans(n_clusters=3, max_iter=1000, random_state=1)

kmeans.fit(data_pca)

data_copy['Labels'] = kmeans.labels_
data['Labels'] = kmeans.labels_

Create the cluster profiles using the summary statistics and box plots for each label¶

In [26]:
data.Labels.value_counts()
Out[26]:
count
Labels
0 372
2 219
1 48

In [27]:
mean = data.groupby('Labels').mean()
median = data.groupby('Labels').median()

df_kmeans = pd.concat([mean, median], axis = 0)
df_kmeans.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_kmeans.T
Out[27]:
group_0 Mean group_1 Mean group_2 Mean group_0 Median group_1 Median group_2 Median
Avg_Credit_Limit 33922.043011 140937.500000 12246.575342 31500.0 145500.0 12000.0
Total_Credit_Cards 5.516129 8.833333 2.406393 6.0 9.0 2.0
Total_visits_bank 3.481183 0.604167 0.949772 3.0 1.0 1.0
Total_visits_online 0.981183 10.958333 3.575342 1.0 11.0 4.0
Total_calls_made 2.002688 1.062500 6.872146 2.0 1.0 7.0
In [28]:
data_copy.boxplot(by = 'Labels', layout = (1, 5), figsize = (20, 7))

plt.show()
No description has been provided for this image

Cluster Profiles:_______________

Cluster 0

  • Average Credit Limit: Moderate. mean at 33,920
  • Total Credit Cards: Moderate, mean at 5.5
  • Total Visits Bank: High, mean at 3.4
  • Total Visits Online: Low, mean at 0.98
  • Total Calls Made: Moderate, mean at 2

This group of customers are credit heavy users who prefer in person banking over online interactions.

Cluster 1

  • Average Credit Limit: High, mean at 140,937
  • Total Credit Cards: High, mean at 8.8
  • Total Visits Bank: Low, mean at 0.6
  • Total Visits Online: High, mean at 10.9
  • Total Calls Made: Low, mean at 1.06

This group of customers are probably affluential with very high credit limits.They have low interactions with the banks.

Cluster 3

  • Average Credit Limit: Low, mean at 12,246
  • Total Credit Cards: Low, mean at 2.4
  • Total Visits Bank: Moderate, mean at 0.9
  • Total Visits Online: Moderate, mean at 3.6
  • Total Calls Made: High, mean at 6.9

This group of customers are balanced users. They have low credit limits and low total credit cards. However, they do visit the banks in person and online a lot.

Gaussian Mixture Model¶

Let's now create clusters using the Gaussian Mixture Model.

  • Apply the Gaussian Mixture Model algorithm on the pca components
In [29]:
gmm = GaussianMixture(n_components=3, random_state=1)

gmm.fit(data_pca)

data_copy['GmmLabels'] = gmm.predict(data_pca)
data['GmmLabels'] = gmm.predict(data_pca)
In [30]:
data.GmmLabels.value_counts()
Out[30]:
count
GmmLabels
0 372
2 219
1 48

Create the cluster profiles using the summary statistics and box plots for each label¶

In [31]:
original_features = ["Avg_Credit_Limit", "Total_Credit_Cards", "Total_visits_bank", "Total_visits_online", "Total_calls_made"]

mean = data.groupby('GmmLabels').mean()
median = data.groupby('GmmLabels').median()

df_gmm = pd.concat([mean, median], axis = 0)
df_gmm.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']
df_gmm[original_features].T
Out[31]:
group_0 Mean group_1 Mean group_2 Mean group_0 Median group_1 Median group_2 Median
Avg_Credit_Limit 33922.043011 140937.500000 12246.575342 31500.0 145500.0 12000.0
Total_Credit_Cards 5.516129 8.833333 2.406393 6.0 9.0 2.0
Total_visits_bank 3.481183 0.604167 0.949772 3.0 1.0 1.0
Total_visits_online 0.981183 10.958333 3.575342 1.0 11.0 4.0
Total_calls_made 2.002688 1.062500 6.872146 2.0 1.0 7.0
In [32]:
features_with_lables = ["Avg_Credit_Limit", "Total_Credit_Cards", "Total_visits_bank", "Total_visits_online", "Total_calls_made", "GmmLabels"]

data_copy[features_with_lables].boxplot(by = 'GmmLabels', layout = (1, 5),figsize = (20, 7))

plt.show()
No description has been provided for this image

Compare the clusters from both algorithms - K-means and Gaussian Mixture Model¶

In [33]:
(data['Labels'] == data['GmmLabels']).value_counts()
Out[33]:
count
True 639

Comparing Clusters:____________

As visible from both the clusters K-Means and GMM produced identical cluster assignments for every customer in this dataset. Which means there is no need to distinguish between the two models for interpretations.

K-Medoids¶

  • Apply the K-Medoids clustering algorithm on the pca components
In [34]:
!pip install scikit-learn-extra
Requirement already satisfied: scikit-learn-extra in /usr/local/lib/python3.11/dist-packages (0.3.0)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra) (1.26.4)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra) (1.15.3)
Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn-extra) (1.7.0)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (3.6.0)
In [38]:
from sklearn_extra.cluster import KMedoids

kmedo = KMedoids(n_clusters=3, random_state=1)
kmedo.fit(data_pca)

data_copy['kmedoLabels'] = kmedo.predict(data_pca)

data['kmedoLabels'] = kmedo.predict(data_pca)
In [39]:
data.kmedoLabels.value_counts()
Out[39]:
count
kmedoLabels
2 287
0 220
1 132

Create cluster profiles using the summary statistics and box plots for each label¶

In [40]:
mean = data.groupby('kmedoLabels').mean()

median = data.groupby('kmedoLabels').median()

df_kmedoids = pd.concat([mean, median], axis = 0)

df_kmedoids.index = ['group_0 Mean', 'group_1 Mean', 'group_2 Mean', 'group_0 Median', 'group_1 Median', 'group_2 Median']

df_kmedoids[original_features].T
Out[40]:
group_0 Mean group_1 Mean group_2 Mean group_0 Median group_1 Median group_2 Median
Avg_Credit_Limit 12222.727273 84939.393939 28449.477352 12000.0 68000.0 20000.0
Total_Credit_Cards 2.418182 7.037879 5.372822 2.0 7.0 5.0
Total_visits_bank 0.954545 1.704545 3.822300 1.0 2.0 4.0
Total_visits_online 3.568182 4.583333 0.989547 4.0 2.0 1.0
Total_calls_made 6.859091 1.962121 1.857143 7.0 2.0 2.0
In [37]:
features_with_lables = ["Avg_Credit_Limit", "Total_Credit_Cards", "Total_visits_bank", "Total_visits_online", "Total_calls_made", "kmedoLabels"]

data_copy[features_with_lables].boxplot(by = 'kmedoLabels', layout = (1, 5), figsize = (20, 7))

plt.show()
No description has been provided for this image

Cluster Profiles:____________

Cluster 0

  • Average Credit Limit: Low. mean at 12,222
  • Total Credit Cards: Low mean at 2.4
  • Total Visits Bank: Low, mean at 0.95
  • Total Visits Online: Moderate, mean at 3.6
  • Total Calls Made: High, mean at 6.9

This group of customers appears to be high-touch, low-credit customers. They engage heavily with the bank, both online and offline, likely needing more service or support.

Cluster 1

  • Average Credit Limit: High, mean at 84,939
  • Total Credit Cards: High, mean at 7
  • Total Visits Bank: Moderate, mean at 1.7
  • Total Visits Online: High, mean at 4.6
  • Total Calls Made: Moderate, mean at 1.96

This group of customers are high value customers, they requre a lot of maintanence as well.

Cluster 3

  • Average Credit Limit: Moderate, mean at 28,449
  • Total Credit Cards: Moderate, mean at 5.4
  • Total Visits Bank: High, mean at 3.8
  • Total Visits Online: Low, mean at 1.0
  • Total Calls Made: Low, mean at 1.86

These are balanced users. They have moderate credit and engagement. They may represent the average or transitional segment between the other two clusters.

Compare the clusters from K-Means and K-Medoids¶

In [43]:
comparison = pd.concat([df_kmedoids, df_kmeans], axis = 1)[original_features]

comparison
Out[43]:
Avg_Credit_Limit Avg_Credit_Limit Total_Credit_Cards Total_Credit_Cards Total_visits_bank Total_visits_bank Total_visits_online Total_visits_online Total_calls_made Total_calls_made
group_0 Mean 12222.727273 33922.043011 2.418182 5.516129 0.954545 3.481183 3.568182 0.981183 6.859091 2.002688
group_1 Mean 84939.393939 140937.500000 7.037879 8.833333 1.704545 0.604167 4.583333 10.958333 1.962121 1.062500
group_2 Mean 28449.477352 12246.575342 5.372822 2.406393 3.822300 0.949772 0.989547 3.575342 1.857143 6.872146
group_0 Median 12000.000000 31500.000000 2.000000 6.000000 1.000000 3.000000 4.000000 1.000000 7.000000 2.000000
group_1 Median 68000.000000 145500.000000 7.000000 9.000000 2.000000 1.000000 2.000000 11.000000 2.000000 1.000000
group_2 Median 20000.000000 12000.000000 5.000000 2.000000 4.000000 1.000000 1.000000 4.000000 2.000000 7.000000

Comparing Clusters:___________________

Cluster 0:

  • K Medoids captures this group as low credit, high service need whereas K Means captures this group as mid credit, branch first users.

Cluster 1:

  • Both K Medoids and K Meands captures this group as high credit and low touch customers.

Cluster 2:

  • K Medoids sees this as branch heavy, mid tier customers with high support usage.
  • K Means frames this group as low credit, high service across digital.

Conclusions and Business Recommendations¶

  • It is important for the business to penetrate the credit card segent further. They need to launch personalized marketing campaigns based on the clusters defined.
  • A lot of customers seem to be visiting the branch multiple times or online so therefore it is important to increase customer support experience.
  • Cluster 0 are the low average credit limit and high engagement customers. They are probably younger people. For them the bank can promote self service tools, provide them with ways to boost their credit limits.
  • Cluster 1 are the affluent speakers. They have very high credit limit and low engagement. This means they are wealthy class. For them the bank can create VIP loyalty program, offer more cashback, offer lifestyle perks.
  • Cluster 2 are the mid range credit limit customers with very low digital engagement. For them the bank can create online registration programs, provide additional benefit if done online, have more webinars in place to help them with digital adoption.