11. Support Vector Machine
- Support Vector Machine
- Preparing data
- Running baseline model
- Narrowing down parameters
- Finding optimal hyperparameters
- Running optimised model
- Comparing results
- Exporting
We run our sixth ML model, an SVM, first with default parameters, then we attempt to tune hyperparameters to improve it. We also visualise various accuracy scores, the confusion matrix and the ROC curve. We end by dumping our best model for further comparison.
%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
modelling_df=pd.read_csv(data_filepath + 'master_modelling_df.csv', index_col=0)
#check
modelling_df.info()
Image(dictionary_filepath+"5-Modelling-Data-Dictionary.png")
X =modelling_df.loc[:, modelling_df.columns != 'is_functioning']
y = modelling_df['is_functioning']
#check
print(X.shape)
print(y.shape)
Our independent variable (X) should have the same number of rows (107,184) than our dependent variable (y). y should only have one column as it is the outcome variable.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=rand_seed)
sm = SMOTE(random_state=rand_seed)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
#compre resampled dataset
print(f"Test set has {round(y_test.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_test.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Original train set has {round(y_train.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Resampled train set has {round(y_train_res.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train_res.value_counts(normalize=True)[1]*100,1)}% functioning")
We over-sample the minority class, non-functioning water points, to get an equal distribution of our outcome variable. Note this should be done on the train set and not the test set as we should not tinker with the latter.
X_train_res_scaled, X_test_scaled = scaling(StandardScaler(), X_train_res, X_test)
We need to scale our data to prevent features with bigger scales to bias our estimates.
start=time.time()
#instantiate and fit
SVC_base = LinearSVC(max_iter=500, random_state=rand_seed).fit(X_train_res_scaled, y_train_res)
end=time.time()
time_fit_base=end-start
print(f"Time to fit the model on the training set is {round(time_fit_base, 3)} seconds")
print(f"Accuracy score for train set is is: {SVC_base.score(X_train_res_scaled, y_train_res)}")
print(f"Accuracy score for test set is is:{SVC_base.score(X_test_scaled, y_test)}")
Accuracy score for our baseline model is low, at 62%. SVMs are margin margin classifiers: they try to pick a decision boundary which is as far apart from the two classes as possible. It makes that decision boundary more generalisable and hopefully better on unseen datasets.
SVC_base_clf = LinearSVC(max_iter=500, random_state=rand_seed)
#get probabilities
CLF_base = CalibratedClassifierCV(SVC_base_clf).fit(X_train_res_scaled, y_train_res)
We convert the SVM's decision rule into probabilities using the Calibrated Classifier. It uses cross validation to estimate the probabilities of each observation being a 1 (functioning water point).
fpr_train_base, tpr_train_base, roc_auc_train_base, precision_train_base_plot, recall_train_base_plot, pr_auc_train_base, time_predict_train_base = print_report(CLF_base, X_train_res_scaled, y_train_res)
#storing accuracy scores
accuracy_train_base, precision_train_base, recall_train_base, f1_train_base = get_scores(CLF_base, X_train_res_scaled, y_train_res)
This conversion enables us to calculate the confusion matrix and accuracy metrics for a SVM model. Here the training set is equally divided between predicted functioning and non-functioning. As our dataset is perfectly balanced, there is an equal proportion of TP/TF and FP/FN.
fpr_test_base, tpr_test_base, roc_auc_test_base, precision_test_base_plot, recall_test_base_plot, pr_auc_test_base, time_predict_test_base = print_report(CLF_base, X_test_scaled, y_test)
print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_base,3)} seconds")
#storing accuracy scores
accuracy_test_base, precision_test_base, recall_test_base, f1_test_base = get_scores(CLF_base, X_test_scaled, y_test)
The model performs badly on the test set. We clearly see that the way in which our model classifies water points is not accurate. Only two thirds of total functioning water points are correctly identified.
# set range of penalties
c_range = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
accuracy_scores = pd.DataFrame()
for c in c_range:
#instantiate and fit
SVC = LinearSVC(C=c, max_iter=500, random_state=rand_seed).fit(
X_train_res_scaled, y_train_res)
# store accuracy scores
train_score = SVC.score(X_train_res_scaled, y_train_res)
test_score = SVC.score(X_test_scaled, y_test)
# append to list
accuracy_scores = accuracy_scores.append(
{'Penalty': c, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
# visualise relationship between penalty and accuracy
plt.figure()
plt.plot(accuracy_scores['Penalty'],
accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Penalty'],
accuracy_scores['Test_score'], label='test score', marker='.')
plt.xscale('log')
plt.xlabel('Penalty')
plt.ylabel("Accuracy")
plt.title("Increasing the penalty reduces accuracy score dramatically")
plt.legend(loc='best')
plt.grid()
plt.show()
We see that increasing the penalty (meaning we prevent overfitting and make our model more general) hurts the accuracy scores for the train and test scores. There is an especially big drop when the penalty gets larger than 1.
We run a grid search cross validation to attempt to find the best combination of hyperparameters for our model.
estimator = [('scaler', StandardScaler()), ('SVC', LinearSVC(max_iter=500))]
# defining distribution of parameters we want to compare
param = {"SVC__C": c_range,
"SVC__penalty":['l1', 'l2']}
# run cross validation
pipeline_cross_val_grid(estimator, param, X_train_res, y_train_res, X_test, y_test)
The optimal model has a penalty of 0.001, using the Ridge regularisation. The penalty added to the cost function is the coefficients squared (Ridge), as opposed to taking their absolute value (Lasso). This penalty is very small and we expect it not to affect our model too strongly.
start=time.time()
#instantiate and fit
SVC_opt = LinearSVC(max_iter=500, C=0.001, penalty='l2', random_state=rand_seed).fit(X_train_res_scaled, y_train_res)
end=time.time()
time_fit_base=end-start
print(f"Time to fit the model on the training set is {round(time_fit_base, 3)} seconds")
print(f"Accuracy score for train set is is: {SVC_opt.score(X_train_res_scaled, y_train_res)}")
print(f"Accuracy score for test set is is:{SVC_opt.score(X_test_scaled, y_test)}")
The accuracy score on the test very slightly increases. Let's see if it improves any of our precision/recall scores. The fitting time is substantially reduced (15x shorter!).
SVC_opt_clf = LinearSVC(max_iter=500, C=0.001, penalty='l2', random_state=rand_seed)
#get probabilities
CLF_opt = CalibratedClassifierCV(SVC_opt_clf).fit(X_train_res_scaled, y_train_res)
fpr_train_opt, tpr_train_opt, roc_auc_train_opt, precision_train_opt_plot, recall_train_opt_plot, pr_auc_train_opt, time_predict_train_opt = print_report(CLF_opt, X_train_res_scaled, y_train_res)
#storing accuracy scores
accuracy_train_opt, precision_train_opt, recall_train_opt, f1_train_opt = get_scores(CLF_opt, X_train_res_scaled, y_train_res)
fpr_test_opt, tpr_test_opt, roc_auc_test_opt, precision_test_opt_plot, recall_test_opt_plot, pr_auc_test_opt, time_predict_test_opt = print_report(CLF_opt, X_test_scaled, y_test)
print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_opt,3)} seconds")
#storing accuracy scores
accuracy_test_opt, precision_test_opt, recall_test_opt, f1_test_opt = get_scores(CLF_opt, X_test_scaled, y_test)
Both of the training and test set all perform exactly the same. It seems that our data is not ideal for a linear SVC. Overall accuracy scores are very low.
plot_curve_roc('SVC', fpr_train_base, tpr_train_base, roc_auc_train_base, fpr_train_opt, tpr_train_opt, roc_auc_train_opt, fpr_test_base,
tpr_test_base, roc_auc_test_base, fpr_test_opt, tpr_test_opt, roc_auc_test_opt)
As seen above the performance of the baseline and optimised model is the same. We go with the baseline one in the hope that it more generalisable to other datasets.
joblib.dump(SVC_base, model_filepath+'support_vector_machine_model.sav')
d = {'Model':['Linear SVC'], 'Parameters':['Penalty=l2, C=0.001, Standard Scaler'], 'Accuracy Train': [accuracy_train_base],\
'Precision Train': [precision_train_base], 'Recall Train': [recall_train_base], 'F1 Train': [f1_train_base], 'ROC AUC Train':[roc_auc_train_base],\
'Accuracy Test': accuracy_test_base, 'Precision Test': [precision_test_base], 'Recall Test': [recall_test_base], 'F1 Test': [f1_test_base],\
'ROC AUC Test':[roc_auc_test_base],'Time Fit': time_fit_base,\
'Time Predict': time_predict_test_base, "Precision Non-functioning Test":0.28, "Recall Non-functioning Test":0.55,\
"F1 Non-functioning Test":0.37,"Precision Functioning Test":0.85, "Recall Functioning Test":0.65,"F1 Functioning Test":0.74}
#to dataframe
best_model_result_df=pd.DataFrame(data=d)
#check
best_model_result_df
best_model_result_df.to_csv(model_filepath + 'support_vector_machine_model.csv')
metrics=[fpr_train_base, tpr_train_base, fpr_test_base, tpr_test_base]
metrics_name=['fpr_train_base', 'tpr_train_base', 'fpr_test_base', 'tpr_test_base']
#save numpy arrays for model comparison
for metric, metric_name in zip(metrics, metrics_name):
np.save(model_filepath+f'support_vector_machine_{metric_name}', metric)