Back to Portfolio

Customer Churn Prediction:
A Classification Approach

Business Intelligence
DQ Daniel Querales

Building a predictive model to identify customers most likely to churn (cancel their subscription or service) using historical data, enabling proactive retention strategies.

Project Overview: Mitigating Revenue Loss

Churn prediction is a binary classification problem: classifying a customer as 'will churn' (1) or 'will not churn' (0). This project uses a synthetic telecommunications dataset to analyze customer behavior (e.g., contract type, monthly charges, usage) and predict their loyalty status.

  • Dataset: Telco Customer Churn Dataset (simulated features).
  • Goal: Predict the binary outcome ('Churn' column).
  • Models Used: Logistic Regression (Baseline) and Random Forest Classifier.
  • Key Metrics: Precision, Recall, F1-Score, and AUC-ROC.

Step 1: Data Preprocessing and Feature Selection

The dataset includes categorical features (like 'Gender', 'Contract Type') and numerical features ('Tenure', 'Monthly Charges'). Categorical variables must be converted to a numerical format using techniques like One-Hot Encoding.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load dataset (simulation)
df = pd.read_csv('telco_churn_data.csv')

# Drop CustomerID and convert target variable
df = df.drop('customerID', axis=1)
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# Separate features (X) and target (y)
X = df.drop('Churn', axis=1)
y = df['Churn']

# Identify categorical and numerical features
categorical_cols = X.select_dtypes(include='object').columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

# Pipeline for preprocessing (scaling numerical, encoding categorical)
# This snippet focuses on demonstrating the split and basic encoding concept.
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {len(X_train)} samples.")

Step 2: Model Training and Selection

We start with Logistic Regression as a linear baseline and then move to Random Forest, an ensemble method capable of capturing complex non-linear relationships, which is typically more effective in highly imbalanced churn datasets.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Standardize numerical features after splitting (good practice)
# scaler = StandardScaler().fit(X_train[numerical_cols])
# X_train_scaled = scaler.transform(X_train[numerical_cols])
# X_test_scaled = scaler.transform(X_test[numerical_cols])

# 1. Train Logistic Regression (Baseline)
log_model = LogisticRegression(solver='liblinear', random_state=42)
# log_model.fit(X_train, y_train)

# 2. Train Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
# rf_model.fit(X_train, y_train)

print("Models initialized and ready for training (simulated).")

Step 3: Performance Evaluation (Focusing on Imbalance)

Since churn datasets are often imbalanced (fewer customers churn than stay), **Accuracy** alone is misleading. We focus on **Recall** (finding all churners) and the **AUC-ROC score** (model's ability to distinguish between classes).

AUC-ROC Score

Area Under the Receiver Operating Characteristic curve. Measures classification quality regardless of threshold.

0.84 (Random Forest - Simulated)

Interpretation: An AUC of 0.84 indicates a strong ability to rank churners higher than non-churners.

Recall (Churn Class)

Of all the customers who actually churned, how many did the model correctly identify?

0.79 (Simulated)

Interpretation: The model correctly flags 79% of the customers who are going to leave (True Positives).

[Image of a ROC curve]

Feature Importance and Business Impact

Beyond just prediction, the Random Forest model allows us to easily extract which features are most critical in driving churn decisions. Understanding these factors provides actionable insights for the business.

Bar chart showing feature importance, with Contract type and Monthly charges being the highest
Figure 1: Feature Importance - Key factors for churn typically include contract type, lack of tenure, and high monthly charges.
  • Actionable Insight: Customers on month-to-month contracts and those paying high monthly fees without security options show the highest propensity to churn.
  • Mitigation Strategy: The business can target the top 10% 'at risk' customers with tailored retention offers (e.g., long-term contract discounts or personalized service check-ins).
  • Continuous Monitoring: The model must be retrained regularly as customer behavior and market conditions change over time.