We run our sixth ML model, an SVM, first with default parameters, then we attempt to tune hyperparameters to improve it. We also visualise various accuracy scores, the confusion matrix and the ROC curve. We end by dumping our best model for further comparison.

Support Vector Machine

%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py

/Users/thomasadler/opt/anaconda3/lib/python3.9/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

Preparing data

modelling_df=pd.read_csv(data_filepath + 'master_modelling_df.csv', index_col=0)

#check
modelling_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107184 entries, 0 to 108905
Data columns (total 32 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   lat_deg                   107184 non-null  float64
 1   lon_deg                   107184 non-null  float64
 2   is_functioning            107184 non-null  int64  
 3   distance_to_primary       107184 non-null  float64
 4   distance_to_secondary     107184 non-null  float64
 5   distance_to_tertiary      107184 non-null  float64
 6   distance_to_city          107184 non-null  float64
 7   distance_to_town          107184 non-null  float64
 8   usage_cap                 107184 non-null  float64
 9   is_complex_tech           107184 non-null  int64  
 10  is_installed_after_2006   107184 non-null  int64  
 11  is_public_management      107184 non-null  int64  
 12  crucialness               107184 non-null  float64
 13  perc_hh_head_male         107184 non-null  float64
 14  perc_pop1318_secondary    107184 non-null  float64
 15  perc_pop017_certificate   107184 non-null  float64
 16  perc_pop017_both_parents  107184 non-null  float64
 17  perc_pop2p_disability     107184 non-null  float64
 18  perc_pop1017_married      107184 non-null  float64
 19  perc_pop1217_birth        107184 non-null  float64
 20  perc_pop1464_working      107184 non-null  float64
 21  perc_hh_temp_dwelling     107184 non-null  float64
 22  perc_hh_mosquito_net      107184 non-null  float64
 23  perc_hh_toilet            107184 non-null  float64
 24  perc_hh_own_house         107184 non-null  float64
 25  perc_hh_bank_acc          107184 non-null  float64
 26  perc_hh_electricity       107184 non-null  float64
 27  total_events_adm4         107184 non-null  float64
 28  perc_local_served         107184 non-null  float64
 29  is_central                107184 non-null  int64  
 30  is_eastern                107184 non-null  int64  
 31  is_western                107184 non-null  int64  
dtypes: float64(25), int64(7)
memory usage: 27.0 MB

Image(dictionary_filepath+"5-Modelling-Data-Dictionary.png")

X =modelling_df.loc[:, modelling_df.columns != 'is_functioning']
y = modelling_df['is_functioning']

#check
print(X.shape)
print(y.shape)

(107184, 31)
(107184,)

Our independent variable (X) should have the same number of rows (107,184) than our dependent variable (y). y should only have one column as it is the outcome variable.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=rand_seed)

sm = SMOTE(random_state=rand_seed)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

#compre resampled dataset
print(f"Test set has {round(y_test.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_test.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Original train set has {round(y_train.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Resampled train set has {round(y_train_res.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train_res.value_counts(normalize=True)[1]*100,1)}% functioning")

Test set has 19.7% non-functioning water points and 80.3% functioning
Original train set has 19.5% non-functioning water points and 80.5% functioning
Resampled train set has 50.0% non-functioning water points and 50.0% functioning

We over-sample the minority class, non-functioning water points, to get an equal distribution of our outcome variable. Note this should be done on the train set and not the test set as we should not tinker with the latter.

X_train_res_scaled, X_test_scaled = scaling(StandardScaler(), X_train_res, X_test)

We need to scale our data to prevent features with bigger scales to bias our estimates.

Running baseline model

start=time.time()

#instantiate and fit
SVC_base = LinearSVC(max_iter=500, random_state=rand_seed).fit(X_train_res_scaled, y_train_res)

end=time.time()

time_fit_base=end-start

print(f"Time to fit the model on the training set is {round(time_fit_base, 3)} seconds")
print(f"Accuracy score for train set is is: {SVC_base.score(X_train_res_scaled, y_train_res)}")
print(f"Accuracy score for test set is is:{SVC_base.score(X_test_scaled, y_test)}")

Time to fit the model on the training set is 24.858 seconds
Accuracy score for train set is is: 0.6506798677954309
Accuracy score for test set is is:0.6219620282688809

Accuracy score for our baseline model is low, at 62%. SVMs are margin margin classifiers: they try to pick a decision boundary which is as far apart from the two classes as possible. It makes that decision boundary more generalisable and hopefully better on unseen datasets.

SVC_base_clf = LinearSVC(max_iter=500, random_state=rand_seed)

#get probabilities
CLF_base = CalibratedClassifierCV(SVC_base_clf).fit(X_train_res_scaled, y_train_res)

We convert the SVM's decision rule into probabilities using the Calibrated Classifier. It uses cross validation to estimate the probabilities of each observation being a 1 (functioning water point).

fpr_train_base, tpr_train_base, roc_auc_train_base, precision_train_base_plot, recall_train_base_plot, pr_auc_train_base, time_predict_train_base = print_report(CLF_base, X_train_res_scaled, y_train_res)

#storing accuracy scores
accuracy_train_base, precision_train_base, recall_train_base, f1_train_base = get_scores(CLF_base, X_train_res_scaled, y_train_res)

ROC AUC: 0.7104575976682433
PR AUC: 0.7071331417357499
              precision    recall  f1-score   support

           0       0.65      0.65      0.65     68984
           1       0.65      0.66      0.65     68984

    accuracy                           0.65    137968
   macro avg       0.65      0.65      0.65    137968
weighted avg       0.65      0.65      0.65    137968

This conversion enables us to calculate the confusion matrix and accuracy metrics for a SVM model. Here the training set is equally divided between predicted functioning and non-functioning. As our dataset is perfectly balanced, there is an equal proportion of TP/TF and FP/FN.

fpr_test_base, tpr_test_base, roc_auc_test_base, precision_test_base_plot, recall_test_base_plot, pr_auc_test_base, time_predict_test_base = print_report(CLF_base, X_test_scaled, y_test)

print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_base,3)} seconds")

#storing accuracy scores
accuracy_test_base, precision_test_base, recall_test_base, f1_test_base = get_scores(CLF_base, X_test_scaled, y_test)

ROC AUC: 0.6407327354641204
PR AUC: 0.870834632919976
              precision    recall  f1-score   support

           0       0.28      0.55      0.37      4221
           1       0.85      0.65      0.74     17216

    accuracy                           0.63     21437
   macro avg       0.57      0.60      0.55     21437
weighted avg       0.74      0.63      0.67     21437

Time to predict the outcome variable for the test set is 0.019 seconds

The model performs badly on the test set. We clearly see that the way in which our model classifies water points is not accurate. Only two thirds of total functioning water points are correctly identified.

Narrowing down parameters

# set range of penalties
c_range = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
accuracy_scores = pd.DataFrame()

for c in c_range:

    #instantiate and fit
    SVC = LinearSVC(C=c, max_iter=500, random_state=rand_seed).fit(
        X_train_res_scaled, y_train_res)

    # store accuracy scores
    train_score = SVC.score(X_train_res_scaled, y_train_res)
    test_score = SVC.score(X_test_scaled, y_test)

    # append to list
    accuracy_scores = accuracy_scores.append(
        {'Penalty': c, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)

# visualise relationship between penalty and accuracy
plt.figure()

plt.plot(accuracy_scores['Penalty'],
         accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Penalty'],
         accuracy_scores['Test_score'], label='test score', marker='.')

plt.xscale('log')
plt.xlabel('Penalty')
plt.ylabel("Accuracy")

plt.title("Increasing the penalty reduces accuracy score dramatically")
plt.legend(loc='best')
plt.grid()

plt.show()

We see that increasing the penalty (meaning we prevent overfitting and make our model more general) hurts the accuracy scores for the train and test scores. There is an especially big drop when the penalty gets larger than 1.

Finding optimal hyperparameters

We run a grid search cross validation to attempt to find the best combination of hyperparameters for our model.

estimator = [('scaler', StandardScaler()), ('SVC', LinearSVC(max_iter=500))]

# defining distribution of parameters we want to compare
param = {"SVC__C": c_range,
"SVC__penalty":['l1', 'l2']}

# run cross validation
pipeline_cross_val_grid(estimator, param, X_train_res, y_train_res, X_test, y_test)

The model with the best CV score has the following parameters: {'SVC__C': 0.001, 'SVC__penalty': 'l2'}.
The best model has an accuracy score of 0.6236880160470215 on the test set

The optimal model has a penalty of 0.001, using the Ridge regularisation. The penalty added to the cost function is the coefficients squared (Ridge), as opposed to taking their absolute value (Lasso). This penalty is very small and we expect it not to affect our model too strongly.

Running optimised model

start=time.time()

#instantiate and fit
SVC_opt = LinearSVC(max_iter=500, C=0.001, penalty='l2', random_state=rand_seed).fit(X_train_res_scaled, y_train_res)

end=time.time()

time_fit_base=end-start

print(f"Time to fit the model on the training set is {round(time_fit_base, 3)} seconds")
print(f"Accuracy score for train set is is: {SVC_opt.score(X_train_res_scaled, y_train_res)}")
print(f"Accuracy score for test set is is:{SVC_opt.score(X_test_scaled, y_test)}")

Time to fit the model on the training set is 1.171 seconds
Accuracy score for train set is is: 0.6506073872202249
Accuracy score for test set is is:0.6236880160470215

The accuracy score on the test very slightly increases. Let's see if it improves any of our precision/recall scores. The fitting time is substantially reduced (15x shorter!).

SVC_opt_clf = LinearSVC(max_iter=500, C=0.001, penalty='l2', random_state=rand_seed)

#get probabilities
CLF_opt = CalibratedClassifierCV(SVC_opt_clf).fit(X_train_res_scaled, y_train_res)

fpr_train_opt, tpr_train_opt, roc_auc_train_opt, precision_train_opt_plot, recall_train_opt_plot, pr_auc_train_opt, time_predict_train_opt = print_report(CLF_opt, X_train_res_scaled, y_train_res)

#storing accuracy scores
accuracy_train_opt, precision_train_opt, recall_train_opt, f1_train_opt = get_scores(CLF_opt, X_train_res_scaled, y_train_res)

ROC AUC: 0.7103580646408449
PR AUC: 0.7070612637150856
              precision    recall  f1-score   support

           0       0.65      0.65      0.65     68984
           1       0.65      0.66      0.65     68984

    accuracy                           0.65    137968
   macro avg       0.65      0.65      0.65    137968
weighted avg       0.65      0.65      0.65    137968

fpr_test_opt, tpr_test_opt, roc_auc_test_opt, precision_test_opt_plot, recall_test_opt_plot, pr_auc_test_opt, time_predict_test_opt = print_report(CLF_opt, X_test_scaled, y_test)

print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_opt,3)} seconds")

#storing accuracy scores
accuracy_test_opt, precision_test_opt, recall_test_opt, f1_test_opt = get_scores(CLF_opt, X_test_scaled, y_test)

ROC AUC: 0.6401459218996186
PR AUC: 0.8706500211846758
              precision    recall  f1-score   support

           0       0.28      0.55      0.37      4221
           1       0.85      0.65      0.74     17216

    accuracy                           0.63     21437
   macro avg       0.57      0.60      0.55     21437
weighted avg       0.74      0.63      0.67     21437

Time to predict the outcome variable for the test set is 0.007 seconds

Both of the training and test set all perform exactly the same. It seems that our data is not ideal for a linear SVC. Overall accuracy scores are very low.

Comparing results

plot_curve_roc('SVC', fpr_train_base, tpr_train_base, roc_auc_train_base, fpr_train_opt, tpr_train_opt, roc_auc_train_opt, fpr_test_base,
 tpr_test_base, roc_auc_test_base,  fpr_test_opt, tpr_test_opt, roc_auc_test_opt)

As seen above the performance of the baseline and optimised model is the same. We go with the baseline one in the hope that it more generalisable to other datasets.

Exporting

joblib.dump(SVC_base, model_filepath+'support_vector_machine_model.sav')

['/Users/thomasadler/Desktop/futuristic-platipus/models/support_vector_machine_model.sav']

d = {'Model':['Linear SVC'], 'Parameters':['Penalty=l2, C=0.001, Standard Scaler'], 'Accuracy Train': [accuracy_train_base],\
    'Precision Train': [precision_train_base], 'Recall Train': [recall_train_base], 'F1 Train': [f1_train_base], 'ROC AUC Train':[roc_auc_train_base],\
        'Accuracy Test': accuracy_test_base, 'Precision Test': [precision_test_base], 'Recall Test': [recall_test_base], 'F1 Test': [f1_test_base],\
            'ROC AUC Test':[roc_auc_test_base],'Time Fit': time_fit_base,\
                'Time Predict': time_predict_test_base,   "Precision Non-functioning Test":0.28, "Recall Non-functioning Test":0.55,\
                                "F1 Non-functioning Test":0.37,"Precision Functioning Test":0.85, "Recall Functioning Test":0.65,"F1 Functioning Test":0.74}

#to dataframe
best_model_result_df=pd.DataFrame(data=d)

#check
best_model_result_df

best_model_result_df.to_csv(model_filepath + 'support_vector_machine_model.csv')

metrics=[fpr_train_base, tpr_train_base, fpr_test_base, tpr_test_base]
metrics_name=['fpr_train_base', 'tpr_train_base', 'fpr_test_base', 'tpr_test_base']

#save numpy arrays for model comparison
for metric, metric_name in zip(metrics, metrics_name):
    np.save(model_filepath+f'support_vector_machine_{metric_name}', metric)