We run our second ML model, a K nearest neighbours, first with default parameters, then we attempt to tune hyperparameters to improve it. We also visualise various accuracy scores, the confusion matrix and the ROC curve. We end by dumping our best model for further comparison.

K Nearest Neighbors

%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py

Preparing data

modelling_df=pd.read_csv(data_filepath + 'master_modelling_df.csv', index_col=0)

#check
modelling_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107184 entries, 0 to 108905
Data columns (total 32 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   lat_deg                   107184 non-null  float64
 1   lon_deg                   107184 non-null  float64
 2   is_functioning            107184 non-null  int64  
 3   distance_to_primary       107184 non-null  float64
 4   distance_to_secondary     107184 non-null  float64
 5   distance_to_tertiary      107184 non-null  float64
 6   distance_to_city          107184 non-null  float64
 7   distance_to_town          107184 non-null  float64
 8   usage_cap                 107184 non-null  float64
 9   is_complex_tech           107184 non-null  int64  
 10  is_installed_after_2006   107184 non-null  int64  
 11  is_public_management      107184 non-null  int64  
 12  crucialness               107184 non-null  float64
 13  perc_hh_head_male         107184 non-null  float64
 14  perc_pop1318_secondary    107184 non-null  float64
 15  perc_pop017_certificate   107184 non-null  float64
 16  perc_pop017_both_parents  107184 non-null  float64
 17  perc_pop2p_disability     107184 non-null  float64
 18  perc_pop1017_married      107184 non-null  float64
 19  perc_pop1217_birth        107184 non-null  float64
 20  perc_pop1464_working      107184 non-null  float64
 21  perc_hh_temp_dwelling     107184 non-null  float64
 22  perc_hh_mosquito_net      107184 non-null  float64
 23  perc_hh_toilet            107184 non-null  float64
 24  perc_hh_own_house         107184 non-null  float64
 25  perc_hh_bank_acc          107184 non-null  float64
 26  perc_hh_electricity       107184 non-null  float64
 27  total_events_adm4         107184 non-null  float64
 28  perc_local_served         107184 non-null  float64
 29  is_central                107184 non-null  int64  
 30  is_eastern                107184 non-null  int64  
 31  is_western                107184 non-null  int64  
dtypes: float64(25), int64(7)
memory usage: 27.0 MB

Image(dictionary_filepath+"5-Modelling-Data-Dictionary.png")

modelling_df_sample = modelling_df.sample(n=round(len(modelling_df) * 0.3),
                                        random_state=rand_seed)

#shape of sample
modelling_df_sample['is_functioning'].value_counts()

1    25996
0     6159
Name: is_functioning, dtype: int64

We take a subsample of our dataset because KNN models take a very long time to fit, train and predict. We make sure that both of our outcome variables are still present in our subsample. We go through the same process with our subsample as if it was our normal dataset.

X =modelling_df_sample.loc[:, modelling_df_sample.columns != 'is_functioning']
y = modelling_df_sample['is_functioning']

#check
print(X.shape)
print(y.shape)

(32155, 31)
(32155,)

Our independent variable (X) should have the same number of rows (107,184) than our dependent variable (y). y should only have one column as it is the outcome variable.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=rand_seed)

sm = SMOTE(random_state=rand_seed)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

#compre resampled dataset
print(f"Test set has {round(y_test.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_test.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Original train set has {round(y_train.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Resampled train set has {round(y_train_res.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train_res.value_counts(normalize=True)[1]*100,1)}% functioning")

Test set has 19.6% non-functioning water points and 80.4% functioning
Original train set has 19.1% non-functioning water points and 80.9% functioning
Resampled train set has 50.0% non-functioning water points and 50.0% functioning

We over-sample the minority class, non-functioning water points, to get an equal distribution of our outcome variable. Note this should be done on the train set and not the test set as we should not tinker with the latter.

X_train_res_scaled, X_test_scaled = scaling(StandardScaler(), X_train_res, X_test)

Running baseline model

start=time.time()

#instantiate and fit
KNN_base = KNeighborsClassifier().fit(X_train_res_scaled, y_train_res)

end=time.time()

time_fit_base=end-start

print(f"Time to fit the model on the training set is {round(time_fit_base,3)} seconds")

Time to fit the model on the training set is 0.003 seconds

fpr_train_base, tpr_train_base, roc_auc_train_base, precision_train_base_plot, recall_train_base_plot, pr_auc_train_base, time_predict_train_base = print_report(KNN_base, X_train_res_scaled, y_train_res)

#storing accuracy scores
accuracy_train_base, precision_train_base, recall_train_base, f1_train_base = get_scores(KNN_base, X_train_res_scaled, y_train_res)

ROC AUC: 0.9691396173245337
PR AUC: 0.9708571411642523
              precision    recall  f1-score   support

           0       0.86      0.94      0.90     20823
           1       0.93      0.85      0.89     20823

    accuracy                           0.89     41646
   macro avg       0.90      0.89      0.89     41646
weighted avg       0.90      0.89      0.89     41646

Our first KNN has a relatively good accuracy score of 89% on the training set. However, this is only because our model is currently labelling nearly all water points as functioning. Errors mostly come from not recognising functioning water points, shown by the low recall score of 85% for functioning water points.

fpr_test_base, tpr_test_base, roc_auc_test_base, precision_test_base_plot, recall_test_base_plot, pr_auc_test_base, time_predict_test_base = print_report(KNN_base, X_test_scaled, y_test)

print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_base,3)} seconds")

#storing accuracy scores
accuracy_test_base, precision_test_base, recall_test_base, f1_test_base = get_scores(KNN_base, X_test_scaled, y_test)

ROC AUC: 0.7129366832861221
PR AUC: 0.9133937334986582
              precision    recall  f1-score   support

           0       0.37      0.53      0.44      1258
           1       0.87      0.78      0.82      5173

    accuracy                           0.73      6431
   macro avg       0.62      0.66      0.63      6431
weighted avg       0.77      0.73      0.75      6431

Time to predict the outcome variable for the test set is 7.161 seconds

Our test set has an accuracy score of 74%. Similarly to the training set, it misses quite a lot of functioning water points (mislabels around a fifth of all functioning points) and also non-functioning points (around a third of all non-functioning points).

Narrowing down parameters

# set range of neighbors
k_range = [2, 5, 15, 50, 100, 500]
accuracy_scores = pd.DataFrame()

for k in k_range:

    #instantiate and fit
    KNN = KNeighborsClassifier(n_neighbors=k).fit(
        X_train_res_scaled, y_train_res)

    # store accuracy scores
    train_score = KNN.score(X_train_res_scaled, y_train_res)
    test_score = KNN.score(X_test_scaled, y_test)

    # append to list
    accuracy_scores = accuracy_scores.append(
        {'Neighbors': k, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)

# visualise relationship between neighbors and accuracy
plt.figure()

plt.plot(accuracy_scores['Neighbors'],
         accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Neighbors'],
         accuracy_scores['Test_score'], label='test score', marker='.')

plt.xlabel('Neighbors')
plt.ylabel("Accuracy")

plt.title("More neighbors decreases accuracy")
plt.legend(loc='best')
plt.grid()

plt.show()

As the number of neighbors we take into account increases, the accuracy of our model decreases. We need a good middle ground where the gap between the train and test set is not too high (opposite of that is when k=2) and the test accuracy scores are still relatively high (opposite is when k=500). It seems that this middle ground is when k is lower than 100 neighbors

Finding optimal hyperparameters

We run a randomised cross validation through a pipeline to find the optimal hyperparameters. We choose a randomised as opposed to a grid search because KNN models are very expensive.

estimator = [('scaling', StandardScaler()),
 ('reduce_dim', PCA()),
             ('KNN', KNeighborsClassifier())]

# defining distribution of parameters we want to compare
param_dist = {"KNN__n_neighbors": range(1, 50, 1),
'reduce_dim__n_components': [0.5, 0.6, 0.7, 0.8, 0.9, None]}

# run cross validation
pipeline_cross_val_random(estimator, param_dist, X_train_res, y_train_res, X_test, y_test)

The model with the best CV score has the following parameters: {'reduce_dim__n_components': 0.7, 'KNN__n_neighbors': 7}.
The best model has an accuracy score of 0.710464935468823 on the test set

The best model here seems to have a very similar accuracy score, but hopefully it scores better on its precision and recall scores. It seems that the optimal number of neighbors is 4. This is just below the default number of k neighbors that the model had taken in our baseline model. The optimal model also chooses to reduce dimensions (using PCA) to the minimum number of features which can explain 60% of the variance in the dataset.

Running optimised model

X_train_res_scaled_PCA, X_test_scaled_PCA=run_PCA(0.6, X_train_res_scaled, X_test_scaled)

train shape: (41646, 7)
test shape: (6431, 7)

Note that the PCA transformer should be fitted on the training set and then should be used to transform the training and test set. We end up with 10 features that explain 60% of the variance in our dataset.

start=time.time()

#instantiate and fit
KNN_opt = KNeighborsClassifier(n_neighbors=4).fit(X_train_res_scaled_PCA, y_train_res)

end=time.time()

time_fit_opt=end-start

print(f"Time to fit the model on the training set is {round(time_fit_opt, 3)} seconds")

Time to fit the model on the training set is 0.035 seconds

The time to fit the model with our optimal parameter is 10x longer than in the baseline model.

fpr_train_opt, tpr_train_opt, roc_auc_train_opt, precision_train_opt_plot, recall_train_opt_plot, pr_auc_train_opt, time_predict_train_opt = print_report(KNN_opt, X_train_res_scaled_PCA, y_train_res)

#storing accuracy scores
accuracy_train_opt, precision_train_opt, recall_train_opt, f1_train_opt = get_scores(KNN_opt, X_train_res_scaled_PCA, y_train_res)

ROC AUC: 0.9684302091261268
PR AUC: 0.970738276849189
              precision    recall  f1-score   support

           0       0.82      0.97      0.88     20823
           1       0.96      0.78      0.86     20823

    accuracy                           0.87     41646
   macro avg       0.89      0.87      0.87     41646
weighted avg       0.89      0.87      0.87     41646

With PCA transformation, we end up with much more false negatives and less false negatives. It seems ot overidentify non-functioning water points.

fpr_test_opt, tpr_test_opt, roc_auc_test_opt, precision_test_opt_plot, recall_test_opt_plot, pr_auc_test_opt, time_predict_test_opt = print_report(KNN_opt, X_test_scaled_PCA, y_test)

print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_opt,3)} seconds")

#storing accuracy scores
accuracy_test_opt, precision_test_opt, recall_test_opt, f1_test_opt = get_scores(KNN_opt, X_test_scaled_PCA, y_test)

ROC AUC: 0.6975742643178765
PR AUC: 0.9099437918289577
              precision    recall  f1-score   support

           0       0.33      0.61      0.43      1258
           1       0.88      0.70      0.78      5173

    accuracy                           0.68      6431
   macro avg       0.61      0.66      0.61      6431
weighted avg       0.77      0.68      0.71      6431

Time to predict the outcome variable for the test set is 0.138 seconds

On the whole, the optimised model actually performs worse than the baseline model. We infer this from the accuracy score which is lower by 6 percentage points and all accuracy metrics decreasing, apart from the precision score for functioning points. However, the optimised model does have a higher recall score, and this is the metric we are most interested in. As a result we go with the optimised model. In addition, it took much less time to predict the outcome variable for the test set (<1sec vs >4sec).

Comparing results

plot_curve_roc('KNN', fpr_train_base, tpr_train_base, roc_auc_train_base, fpr_train_opt, tpr_train_opt, roc_auc_train_opt, fpr_test_base,
 tpr_test_base, roc_auc_test_base,  fpr_test_opt, tpr_test_opt, roc_auc_test_opt)

The baseline model performs overall best on the test set, which is the score we care the most about. Since PCA makes our results less interpretable and doesn't improve the model, we choose the baseline model to be our best one.

Exporting

joblib.dump(KNN_opt, model_filepath+'k_nearest_neighbors_model.sav')

['/Users/thomasadler/Desktop/futuristic-platipus/models/k_nearest_neighbors_model.sav']

d = {'Model':['K Nearest Neighbors'], 'Parameters':['Neighbors=4, Standard Scaler'],  'Accuracy Train': [accuracy_train_opt],\
    'Precision Train': [precision_train_opt], 'Recall Train': [recall_train_opt], 'F1 Train': [f1_train_opt], 'ROC AUC Train':[roc_auc_train_opt],\
        'Accuracy Test': accuracy_test_opt, 'Precision Test': [precision_test_opt], 'Recall Test': [recall_test_opt], 'F1 Test': [f1_test_opt],\
            'ROC AUC Test':[roc_auc_test_opt], 'Time Fit': time_fit_opt,\
                'Time Predict': time_predict_test_opt, "Precision Non-functioning Test":0.33, "Recall Non-functioning Test":0.65,\
                                "F1 Non-functioning Test":0.44,"Precision Functioning Test":0.89, "Recall Functioning Test":0.69,"F1 Functioning Test":0.77}

#to dataframe
best_model_result_df=pd.DataFrame(data=d)

#check
best_model_result_df

best_model_result_df.to_csv(model_filepath + 'k_nearest_neighbors_model.csv')

metrics=[fpr_train_opt, tpr_train_opt, fpr_test_opt, tpr_test_opt]
metrics_name=['fpr_train_opt', 'tpr_train_opt', 'fpr_test_opt', 'tpr_test_opt']

#save numpy arrays for model comparison
for metric, metric_name in zip(metrics, metrics_name):
    np.save(model_filepath+f'k_nearest_neighbors_{metric_name}', metric)