Churn modelling

Churn modelling: How to predict the churn?

This technical post is heavily derived from the week 03 of the ML Zoomcamp course offered by Alexey Grigorev and dataTalks.Club.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

Load the data

df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head().transpose()
0 1 2 3 4
customerID 7590-VHVEG 5575-GNVDE 3668-QPYBK 7795-CFOCW 9237-HQITU
gender Female Male Male Male Female
SeniorCitizen 0 0 0 0 0
Partner Yes No No No No
Dependents No No No No No
tenure 1 34 2 45 2
PhoneService No Yes Yes No Yes
MultipleLines No phone service No No No phone service No
InternetService DSL DSL DSL DSL Fiber optic
OnlineSecurity No Yes Yes Yes No
OnlineBackup Yes No Yes No No
DeviceProtection No Yes No Yes No
TechSupport No No No Yes No
StreamingTV No No No No No
StreamingMovies No No No No No
Contract Month-to-month One year Month-to-month One year Month-to-month
PaperlessBilling Yes No Yes No Yes
PaymentMethod Electronic check Mailed check Mailed check Bank transfer (automatic) Electronic check
MonthlyCharges 29.85 56.95 53.85 42.3 70.7
TotalCharges 29.85 1889.5 108.15 1840.75 151.65
Churn No No Yes No Yes
df.columns = df.columns.str.lower()\
                       .str.replace(" ", "_")
df.columns
Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')
df.isnull().sum().sort_values(ascending=False)
totalcharges        11
customerid           0
deviceprotection     0
monthlycharges       0
paymentmethod        0
paperlessbilling     0
contract             0
streamingmovies      0
streamingtv          0
techsupport          0
onlinebackup         0
gender               0
onlinesecurity       0
internetservice      0
multiplelines        0
phoneservice         0
tenure               0
dependents           0
partner              0
seniorcitizen        0
churn                0
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerid        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   seniorcitizen     7043 non-null   int64  
 3   partner           7043 non-null   object 
 4   dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   phoneservice      7043 non-null   object 
 7   multiplelines     7043 non-null   object 
 8   internetservice   7043 non-null   object 
 9   onlinesecurity    7043 non-null   object 
 10  onlinebackup      7043 non-null   object 
 11  deviceprotection  7043 non-null   object 
 12  techsupport       7043 non-null   object 
 13  streamingtv       7043 non-null   object 
 14  streamingmovies   7043 non-null   object 
 15  contract          7043 non-null   object 
 16  paperlessbilling  7043 non-null   object 
 17  paymentmethod     7043 non-null   object 
 18  monthlycharges    7043 non-null   float64
 19  totalcharges      7043 non-null   object 
 20  churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
df["totalcharges"] = pd.to_numeric(df['totalcharges'], errors="coerce")
df["totalcharges"].isnull().sum()
11
df["totalcharges"] = df["totalcharges"].fillna(value=0)

New

df["churn"] = (df["churn"] == 'Yes').astype(int)
df.head()
customerid gender seniorcitizen partner dependents tenure phoneservice multiplelines internetservice onlinesecurity ... deviceprotection techsupport streamingtv streamingmovies contract paperlessbilling paymentmethod monthlycharges totalcharges churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 0
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.50 0
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 1
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 0
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 1

5 rows × 21 columns

df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=1)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
y_train = df_train["churn"].values
y_val = df_val["churn"].values
y_test = df_test["churn"].values
del df_train["churn"]
del df_val["churn"]
del df_test["churn"]

Exploratory Data Analysis

To understand the data in detail we will perform an exploratory data analysis.
Let's get started.

df_train_full.isnull().mean().sort_values(ascending=False)
customerid          0.0
deviceprotection    0.0
totalcharges        0.0
monthlycharges      0.0
paymentmethod       0.0
paperlessbilling    0.0
contract            0.0
streamingmovies     0.0
streamingtv         0.0
techsupport         0.0
onlinebackup        0.0
gender              0.0
onlinesecurity      0.0
internetservice     0.0
multiplelines       0.0
phoneservice        0.0
tenure              0.0
dependents          0.0
partner             0.0
seniorcitizen       0.0
churn               0.0
dtype: float64
df_train_full["churn"].value_counts(normalize=True)
0    0.730032
1    0.269968
Name: churn, dtype: float64

We can see that the churn rate is ~27%

df_train_full.dtypes
customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                 int64
dtype: object
numerical = ['tenure', 'monthlycharges', 'totalcharges']
df_train_full.columns
Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
        'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']
df_train_full[categorical].nunique()
gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

Feature Importance

churn_female = df_train_full[df_train_full["gender"] == 'Female']["churn"].mean()
churn_male = df_train_full[df_train_full["gender"] == 'Male']["churn"].mean()

for col in categorical:
    df_risk_group = df_train_full.groupby(col)['churn'].agg(['mean', 'count'])

    df_risk_group['diff'] = df_risk_group['mean'] - 0.2699
    df_risk_group['risk'] = df_risk_group['mean']/ 0.2699
    display(df_risk_group)
mean count diff risk
gender
Female 0.276824 2796 0.006924 1.025654
Male 0.263214 2838 -0.006686 0.975226
mean count diff risk
seniorcitizen
0 0.242270 4722 -0.027630 0.897630
1 0.413377 912 0.143477 1.531594
mean count diff risk
partner
No 0.329809 2932 0.059909 1.221967
Yes 0.205033 2702 -0.064867 0.759664
mean count diff risk
dependents
No 0.313760 3968 0.043860 1.162505
Yes 0.165666 1666 -0.104234 0.613806
mean count diff risk
phoneservice
No 0.241316 547 -0.028584 0.894095
Yes 0.273049 5087 0.003149 1.011667
mean count diff risk
multiplelines
No 0.257407 2700 -0.012493 0.953714
No phone service 0.241316 547 -0.028584 0.894095
Yes 0.290742 2387 0.020842 1.077219
mean count diff risk
internetservice
DSL 0.192347 1934 -0.077553 0.712662
Fiber optic 0.425171 2479 0.155271 1.575292
No 0.077805 1221 -0.192095 0.288274
mean count diff risk
onlinesecurity
No 0.420921 2801 0.151021 1.559545
No internet service 0.077805 1221 -0.192095 0.288274
Yes 0.153226 1612 -0.116674 0.567713
mean count diff risk
onlinebackup
No 0.404323 2498 0.134423 1.498049
No internet service 0.077805 1221 -0.192095 0.288274
Yes 0.217232 1915 -0.052668 0.804862
mean count diff risk
deviceprotection
No 0.395875 2473 0.125975 1.466749
No internet service 0.077805 1221 -0.192095 0.288274
Yes 0.230412 1940 -0.039488 0.853695
mean count diff risk
techsupport
No 0.418914 2781 0.149014 1.552108
No internet service 0.077805 1221 -0.192095 0.288274
Yes 0.159926 1632 -0.109974 0.592540
mean count diff risk
streamingtv
No 0.342832 2246 0.072932 1.270217
No internet service 0.077805 1221 -0.192095 0.288274
Yes 0.302723 2167 0.032823 1.121610
mean count diff risk
streamingmovies
No 0.338906 2213 0.069006 1.255674
No internet service 0.077805 1221 -0.192095 0.288274
Yes 0.307273 2200 0.037373 1.138469
mean count diff risk
contract
Month-to-month 0.431701 3104 0.161801 1.599485
One year 0.120573 1186 -0.149327 0.446733
Two year 0.028274 1344 -0.241626 0.104757
mean count diff risk
paperlessbilling
No 0.172071 2313 -0.097829 0.637536
Yes 0.338151 3321 0.068251 1.252876
mean count diff risk
paymentmethod
Bank transfer (automatic) 0.168171 1219 -0.101729 0.623085
Credit card (automatic) 0.164339 1217 -0.105561 0.608887
Electronic check 0.455890 1893 0.185990 1.689108
Mailed check 0.193870 1305 -0.076030 0.718302

Mutual Information

From the sklearn documentation

Mutual Information is a measure of the similarity between two labels of the same data.


from sklearn.metrics import mutual_info_score
def mutual_info_churn_score(series):
    return mutual_info_score(series, df_train_full['churn'])
df_train_full[categorical].apply(mutual_info_churn_score).sort_values(ascending=False)
contract            0.098320
onlinesecurity      0.063085
techsupport         0.061032
internetservice     0.055868
onlinebackup        0.046923
deviceprotection    0.043453
paymentmethod       0.043210
streamingtv         0.031853
streamingmovies     0.031581
paperlessbilling    0.017589
dependents          0.012346
partner             0.009968
seniorcitizen       0.009410
multiplelines       0.000857
phoneservice        0.000229
gender              0.000117
dtype: float64

Correlation

df_train_full[numerical].corrwith(df_train_full['churn']).to_frame()
0
tenure -0.351885
monthlycharges 0.196805
totalcharges -0.196353

Tenure vs. Churn rate

Let's look at relation between the customers tenure and churn rate.

df_train_full[df_train_full['tenure'] <=2]['churn'].mean()
0.5953420669577875
df_train_full[(df_train_full['tenure'] > 2) & (df_train_full['tenure'] <= 12)]['churn'].mean()
0.3994413407821229
df_train_full[df_train_full['tenure'] > 12]['churn'].mean()
0.17634908339788277

Monthly Charges vs Churn rate

Let's take a look at the monthly charges vs Churn rate

df_train_full[df_train_full['monthlycharges'] <= 20]['churn'].mean()
0.08795411089866156
df_train_full[(df_train_full['monthlycharges'] > 20) & (df_train_full['monthlycharges'] <=50)]['churn'].mean()
0.18340943683409436
df_train_full[df_train_full['monthlycharges'] > 50]['churn'].mean()
0.32499341585462205

As shown the correlation is positive

One-hot encoding

DictVectorizer preserves the numerical column information while the categorical variables are one hot encoded. We fit the vectorizer on the training dataset and only transform the validation dataset.

from sklearn.feature_extraction import DictVectorizer
categorical
['gender',
 'seniorcitizen',
 'partner',
 'dependents',
 'phoneservice',
 'multiplelines',
 'internetservice',
 'onlinesecurity',
 'onlinebackup',
 'deviceprotection',
 'techsupport',
 'streamingtv',
 'streamingmovies',
 'contract',
 'paperlessbilling',
 'paymentmethod',
 'churn']
train_dicts = df_train[categorical+numerical].to_dict(orient="records")
dictvect = DictVectorizer(sparse=False)
dictvect.fit(train_dicts)

DictVectorizer(sparse=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
<input class="sk-toggleablecontrol sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" checked>DictVectorizer
DictVectorizer(sparse=False)

dictvect.feature_names_
['contract=Month-to-month',
 'contract=One year',
 'contract=Two year',
 'dependents=No',
 'dependents=Yes',
 'deviceprotection=No',
 'deviceprotection=No internet service',
 'deviceprotection=Yes',
 'gender=Female',
 'gender=Male',
 'internetservice=DSL',
 'internetservice=Fiber optic',
 'internetservice=No',
 'monthlycharges',
 'multiplelines=No',
 'multiplelines=No phone service',
 'multiplelines=Yes',
 'onlinebackup=No',
 'onlinebackup=No internet service',
 'onlinebackup=Yes',
 'onlinesecurity=No',
 'onlinesecurity=No internet service',
 'onlinesecurity=Yes',
 'paperlessbilling=No',
 'paperlessbilling=Yes',
 'partner=No',
 'partner=Yes',
 'paymentmethod=Bank transfer (automatic)',
 'paymentmethod=Credit card (automatic)',
 'paymentmethod=Electronic check',
 'paymentmethod=Mailed check',
 'phoneservice=No',
 'phoneservice=Yes',
 'seniorcitizen',
 'streamingmovies=No',
 'streamingmovies=No internet service',
 'streamingmovies=Yes',
 'streamingtv=No',
 'streamingtv=No internet service',
 'streamingtv=Yes',
 'techsupport=No',
 'techsupport=No internet service',
 'techsupport=Yes',
 'tenure',
 'totalcharges']
X_train = dictvect.transform(train_dicts)
X_train
array([[0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        7.20000e+01, 8.42515e+03],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+01, 1.02155e+03],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        5.00000e+00, 4.13650e+02],
       ...,
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        2.00000e+00, 1.90050e+02],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        2.70000e+01, 7.61950e+02],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        9.00000e+00, 7.51650e+02]])
val_dicts = df_val[categorical+numerical].to_dict(orient="records")
X_val = dictvect.transform(val_dicts)
X_val
array([[0.0000e+00, 0.0000e+00, 1.0000e+00, ..., 1.0000e+00, 7.1000e+01,
        4.9734e+03],
       [1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        2.0750e+01],
       [1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        2.0350e+01],
       ...,
       [1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 1.0000e+00, 1.8000e+01,
        1.0581e+03],
       [1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        9.3300e+01],
       [1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 3.0000e+00,
        2.9285e+02]])

Logistic Regression

$$y{i} = f(x{i})$$

Limits to values between 0 and 1, using a special function called sigmoid or logit.

$$ \sigma(z) = \frac{1}{1+e^{-z}} $$

where $\sigma(-\infty) = 0 $

def sigmoid(z):
    return 1/(1+ np.exp(-z))
z  = np.linspace(-10, 10, 100)
plt.plot(z, sigmoid(z))
plt.axvline(x=0,color='black', ls='--' )
plt.axhline(y=1.0, color='gray', ls='-.')
plt.axhline(y=0.5, color='gray', ls='-.')
plt.axhline(y=0.0, color='gray', ls='-.')
<matplotlib.lines.Line2D at 0x7f9145b0f760>

png

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
<input class="sk-toggleablecontrol sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" checked>LogisticRegression
LogisticRegression()

model.intercept_
array([-0.10906984])
model.coef_.round(3)
array([[ 0.475, -0.175, -0.407, -0.03 , -0.078,  0.063, -0.089, -0.081,
        -0.034, -0.073, -0.335,  0.316, -0.089,  0.004, -0.258,  0.141,
         0.009,  0.063, -0.089, -0.081,  0.266, -0.089, -0.284, -0.231,
         0.123, -0.166,  0.058, -0.087, -0.032,  0.07 , -0.059,  0.141,
        -0.249,  0.215, -0.12 , -0.089,  0.102, -0.071, -0.089,  0.052,
         0.213, -0.089, -0.232, -0.07 ,  0.   ]])

Hard predictions

model.predict(X_train)
array([0, 1, 1, ..., 1, 0, 1])
model.predict_proba(X_train)
array([[0.90443048, 0.09556952],
       [0.32074291, 0.67925709],
       [0.36639176, 0.63360824],
       ...,
       [0.4684025 , 0.5315975 ],
       [0.95750228, 0.04249772],
       [0.30138089, 0.69861911]])
y_pred = model.predict_proba(X_val)
y_pred
array([[0.99100249, 0.00899751],
       [0.79563017, 0.20436983],
       [0.78794861, 0.21205139],
       ...,
       [0.8636199 , 0.1363801 ],
       [0.20029807, 0.79970193],
       [0.16265877, 0.83734123]])

The first column gives us the probability of user not churning and the second column the probability of user churning.

Now, define a threshold of 0.5, to identify the users who could churn out.

churning_users = (y_pred[:, 1] > 0.5)
df_val[churning_users]['customerid']
3       8433-WXGNA
8       3440-JPSCL
11      2637-FKFSY
12      7228-OMTPN
19      6711-FLDFB
           ...    
1397    5976-JCJRH
1398    2034-CGRHZ
1399    5276-KQWHG
1407    6521-YYTYI
1408    3049-SOLAY
Name: customerid, Length: 311, dtype: object
y_val
array([0, 0, 0, ..., 0, 1, 1])
churning_users.astype(int)
array([0, 0, 0, ..., 0, 1, 1])
(y_val == churning_users).mean()
0.8034066713981547
df_predictions = pd.DataFrame()
df_predictions['churn_probability'] = y_pred[:, 1]
df_predictions['predicted'] = churning_users.astype(int)
df_predictions['actual'] = y_val
df_predictions['correct_predictions'] = (df_predictions['predicted'] == df_predictions['actual'])
df_predictions['correct_predictions'].mean()
0.8034066713981547

Model Interpretation

dict(zip(dictvect.feature_names_, model.coef_[0].round(3)))
{'contract=Month-to-month': 0.475,
 'contract=One year': -0.175,
 'contract=Two year': -0.407,
 'dependents=No': -0.03,
 'dependents=Yes': -0.078,
 'deviceprotection=No': 0.063,
 'deviceprotection=No internet service': -0.089,
 'deviceprotection=Yes': -0.081,
 'gender=Female': -0.034,
 'gender=Male': -0.073,
 'internetservice=DSL': -0.335,
 'internetservice=Fiber optic': 0.316,
 'internetservice=No': -0.089,
 'monthlycharges': 0.004,
 'multiplelines=No': -0.258,
 'multiplelines=No phone service': 0.141,
 'multiplelines=Yes': 0.009,
 'onlinebackup=No': 0.063,
 'onlinebackup=No internet service': -0.089,
 'onlinebackup=Yes': -0.081,
 'onlinesecurity=No': 0.266,
 'onlinesecurity=No internet service': -0.089,
 'onlinesecurity=Yes': -0.284,
 'paperlessbilling=No': -0.231,
 'paperlessbilling=Yes': 0.123,
 'partner=No': -0.166,
 'partner=Yes': 0.058,
 'paymentmethod=Bank transfer (automatic)': -0.087,
 'paymentmethod=Credit card (automatic)': -0.032,
 'paymentmethod=Electronic check': 0.07,
 'paymentmethod=Mailed check': -0.059,
 'phoneservice=No': 0.141,
 'phoneservice=Yes': -0.249,
 'seniorcitizen': 0.215,
 'streamingmovies=No': -0.12,
 'streamingmovies=No internet service': -0.089,
 'streamingmovies=Yes': 0.102,
 'streamingtv=No': -0.071,
 'streamingtv=No internet service': -0.089,
 'streamingtv=Yes': 0.052,
 'techsupport=No': 0.213,
 'techsupport=No internet service': -0.089,
 'techsupport=Yes': -0.232,
 'tenure': -0.07,
 'totalcharges': 0.0}

Did you find this article valuable?

Support Sanjiv Chemudupati by becoming a sponsor. Any amount is appreciated!