7. K Nearest Neighbours
- K Nearest Neighbors
- Preparing data
- Running baseline model
- Narrowing down parameters
- Finding optimal hyperparameters
- Running optimised model
- Comparing results
- Exporting
We run our second ML model, a K nearest neighbours, first with default parameters, then we attempt to tune hyperparameters to improve it. We also visualise various accuracy scores, the confusion matrix and the ROC curve. We end by dumping our best model for further comparison.
%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
modelling_df=pd.read_csv(data_filepath + 'master_modelling_df.csv', index_col=0)
#check
modelling_df.info()
Image(dictionary_filepath+"5-Modelling-Data-Dictionary.png")
modelling_df_sample = modelling_df.sample(n=round(len(modelling_df) * 0.3),
random_state=rand_seed)
#shape of sample
modelling_df_sample['is_functioning'].value_counts()
We take a subsample of our dataset because KNN models take a very long time to fit, train and predict. We make sure that both of our outcome variables are still present in our subsample. We go through the same process with our subsample as if it was our normal dataset.
X =modelling_df_sample.loc[:, modelling_df_sample.columns != 'is_functioning']
y = modelling_df_sample['is_functioning']
#check
print(X.shape)
print(y.shape)
Our independent variable (X) should have the same number of rows (107,184) than our dependent variable (y). y should only have one column as it is the outcome variable.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=rand_seed)
sm = SMOTE(random_state=rand_seed)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
#compre resampled dataset
print(f"Test set has {round(y_test.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_test.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Original train set has {round(y_train.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Resampled train set has {round(y_train_res.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train_res.value_counts(normalize=True)[1]*100,1)}% functioning")
We over-sample the minority class, non-functioning water points, to get an equal distribution of our outcome variable. Note this should be done on the train set and not the test set as we should not tinker with the latter.
X_train_res_scaled, X_test_scaled = scaling(StandardScaler(), X_train_res, X_test)
start=time.time()
#instantiate and fit
KNN_base = KNeighborsClassifier().fit(X_train_res_scaled, y_train_res)
end=time.time()
time_fit_base=end-start
print(f"Time to fit the model on the training set is {round(time_fit_base,3)} seconds")
fpr_train_base, tpr_train_base, roc_auc_train_base, precision_train_base_plot, recall_train_base_plot, pr_auc_train_base, time_predict_train_base = print_report(KNN_base, X_train_res_scaled, y_train_res)
#storing accuracy scores
accuracy_train_base, precision_train_base, recall_train_base, f1_train_base = get_scores(KNN_base, X_train_res_scaled, y_train_res)
Our first KNN has a relatively good accuracy score of 89% on the training set. However, this is only because our model is currently labelling nearly all water points as functioning. Errors mostly come from not recognising functioning water points, shown by the low recall score of 85% for functioning water points.
fpr_test_base, tpr_test_base, roc_auc_test_base, precision_test_base_plot, recall_test_base_plot, pr_auc_test_base, time_predict_test_base = print_report(KNN_base, X_test_scaled, y_test)
print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_base,3)} seconds")
#storing accuracy scores
accuracy_test_base, precision_test_base, recall_test_base, f1_test_base = get_scores(KNN_base, X_test_scaled, y_test)
Our test set has an accuracy score of 74%. Similarly to the training set, it misses quite a lot of functioning water points (mislabels around a fifth of all functioning points) and also non-functioning points (around a third of all non-functioning points).
# set range of neighbors
k_range = [2, 5, 15, 50, 100, 500]
accuracy_scores = pd.DataFrame()
for k in k_range:
#instantiate and fit
KNN = KNeighborsClassifier(n_neighbors=k).fit(
X_train_res_scaled, y_train_res)
# store accuracy scores
train_score = KNN.score(X_train_res_scaled, y_train_res)
test_score = KNN.score(X_test_scaled, y_test)
# append to list
accuracy_scores = accuracy_scores.append(
{'Neighbors': k, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
# visualise relationship between neighbors and accuracy
plt.figure()
plt.plot(accuracy_scores['Neighbors'],
accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Neighbors'],
accuracy_scores['Test_score'], label='test score', marker='.')
plt.xlabel('Neighbors')
plt.ylabel("Accuracy")
plt.title("More neighbors decreases accuracy")
plt.legend(loc='best')
plt.grid()
plt.show()
As the number of neighbors we take into account increases, the accuracy of our model decreases. We need a good middle ground where the gap between the train and test set is not too high (opposite of that is when k=2) and the test accuracy scores are still relatively high (opposite is when k=500). It seems that this middle ground is when k is lower than 100 neighbors
We run a randomised cross validation through a pipeline to find the optimal hyperparameters. We choose a randomised as opposed to a grid search because KNN models are very expensive.
estimator = [('scaling', StandardScaler()),
('reduce_dim', PCA()),
('KNN', KNeighborsClassifier())]
# defining distribution of parameters we want to compare
param_dist = {"KNN__n_neighbors": range(1, 50, 1),
'reduce_dim__n_components': [0.5, 0.6, 0.7, 0.8, 0.9, None]}
# run cross validation
pipeline_cross_val_random(estimator, param_dist, X_train_res, y_train_res, X_test, y_test)
The best model here seems to have a very similar accuracy score, but hopefully it scores better on its precision and recall scores. It seems that the optimal number of neighbors is 4. This is just below the default number of k neighbors that the model had taken in our baseline model. The optimal model also chooses to reduce dimensions (using PCA) to the minimum number of features which can explain 60% of the variance in the dataset.
X_train_res_scaled_PCA, X_test_scaled_PCA=run_PCA(0.6, X_train_res_scaled, X_test_scaled)
Note that the PCA transformer should be fitted on the training set and then should be used to transform the training and test set. We end up with 10 features that explain 60% of the variance in our dataset.
start=time.time()
#instantiate and fit
KNN_opt = KNeighborsClassifier(n_neighbors=4).fit(X_train_res_scaled_PCA, y_train_res)
end=time.time()
time_fit_opt=end-start
print(f"Time to fit the model on the training set is {round(time_fit_opt, 3)} seconds")
The time to fit the model with our optimal parameter is 10x longer than in the baseline model.
fpr_train_opt, tpr_train_opt, roc_auc_train_opt, precision_train_opt_plot, recall_train_opt_plot, pr_auc_train_opt, time_predict_train_opt = print_report(KNN_opt, X_train_res_scaled_PCA, y_train_res)
#storing accuracy scores
accuracy_train_opt, precision_train_opt, recall_train_opt, f1_train_opt = get_scores(KNN_opt, X_train_res_scaled_PCA, y_train_res)
With PCA transformation, we end up with much more false negatives and less false negatives. It seems ot overidentify non-functioning water points.
fpr_test_opt, tpr_test_opt, roc_auc_test_opt, precision_test_opt_plot, recall_test_opt_plot, pr_auc_test_opt, time_predict_test_opt = print_report(KNN_opt, X_test_scaled_PCA, y_test)
print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_opt,3)} seconds")
#storing accuracy scores
accuracy_test_opt, precision_test_opt, recall_test_opt, f1_test_opt = get_scores(KNN_opt, X_test_scaled_PCA, y_test)
On the whole, the optimised model actually performs worse than the baseline model. We infer this from the accuracy score which is lower by 6 percentage points and all accuracy metrics decreasing, apart from the precision score for functioning points. However, the optimised model does have a higher recall score, and this is the metric we are most interested in. As a result we go with the optimised model. In addition, it took much less time to predict the outcome variable for the test set (<1sec vs >4sec).
plot_curve_roc('KNN', fpr_train_base, tpr_train_base, roc_auc_train_base, fpr_train_opt, tpr_train_opt, roc_auc_train_opt, fpr_test_base,
tpr_test_base, roc_auc_test_base, fpr_test_opt, tpr_test_opt, roc_auc_test_opt)
The baseline model performs overall best on the test set, which is the score we care the most about. Since PCA makes our results less interpretable and doesn't improve the model, we choose the baseline model to be our best one.
joblib.dump(KNN_opt, model_filepath+'k_nearest_neighbors_model.sav')
d = {'Model':['K Nearest Neighbors'], 'Parameters':['Neighbors=4, Standard Scaler'], 'Accuracy Train': [accuracy_train_opt],\
'Precision Train': [precision_train_opt], 'Recall Train': [recall_train_opt], 'F1 Train': [f1_train_opt], 'ROC AUC Train':[roc_auc_train_opt],\
'Accuracy Test': accuracy_test_opt, 'Precision Test': [precision_test_opt], 'Recall Test': [recall_test_opt], 'F1 Test': [f1_test_opt],\
'ROC AUC Test':[roc_auc_test_opt], 'Time Fit': time_fit_opt,\
'Time Predict': time_predict_test_opt, "Precision Non-functioning Test":0.33, "Recall Non-functioning Test":0.65,\
"F1 Non-functioning Test":0.44,"Precision Functioning Test":0.89, "Recall Functioning Test":0.69,"F1 Functioning Test":0.77}
#to dataframe
best_model_result_df=pd.DataFrame(data=d)
#check
best_model_result_df
best_model_result_df.to_csv(model_filepath + 'k_nearest_neighbors_model.csv')
metrics=[fpr_train_opt, tpr_train_opt, fpr_test_opt, tpr_test_opt]
metrics_name=['fpr_train_opt', 'tpr_train_opt', 'fpr_test_opt', 'tpr_test_opt']
#save numpy arrays for model comparison
for metric, metric_name in zip(metrics, metrics_name):
np.save(model_filepath+f'k_nearest_neighbors_{metric_name}', metric)