We run our eigth ML model, an xgboost, first with default parameters, then we attempt to tune hyperparameters to improve it. We also visualise various accuracy scores, the confusion matrix and the ROC curve. We end by dumping our best model for further comparison.

XGBoost

%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
/Users/thomasadler/opt/anaconda3/lib/python3.9/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

Preparing data

modelling_df=pd.read_csv(data_filepath + 'master_modelling_df.csv', index_col=0)

#check
modelling_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 107184 entries, 0 to 108905
Data columns (total 32 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   lat_deg                   107184 non-null  float64
 1   lon_deg                   107184 non-null  float64
 2   is_functioning            107184 non-null  int64  
 3   distance_to_primary       107184 non-null  float64
 4   distance_to_secondary     107184 non-null  float64
 5   distance_to_tertiary      107184 non-null  float64
 6   distance_to_city          107184 non-null  float64
 7   distance_to_town          107184 non-null  float64
 8   usage_cap                 107184 non-null  float64
 9   is_complex_tech           107184 non-null  int64  
 10  is_installed_after_2006   107184 non-null  int64  
 11  is_public_management      107184 non-null  int64  
 12  crucialness               107184 non-null  float64
 13  perc_hh_head_male         107184 non-null  float64
 14  perc_pop1318_secondary    107184 non-null  float64
 15  perc_pop017_certificate   107184 non-null  float64
 16  perc_pop017_both_parents  107184 non-null  float64
 17  perc_pop2p_disability     107184 non-null  float64
 18  perc_pop1017_married      107184 non-null  float64
 19  perc_pop1217_birth        107184 non-null  float64
 20  perc_pop1464_working      107184 non-null  float64
 21  perc_hh_temp_dwelling     107184 non-null  float64
 22  perc_hh_mosquito_net      107184 non-null  float64
 23  perc_hh_toilet            107184 non-null  float64
 24  perc_hh_own_house         107184 non-null  float64
 25  perc_hh_bank_acc          107184 non-null  float64
 26  perc_hh_electricity       107184 non-null  float64
 27  total_events_adm4         107184 non-null  float64
 28  perc_local_served         107184 non-null  float64
 29  is_central                107184 non-null  int64  
 30  is_eastern                107184 non-null  int64  
 31  is_western                107184 non-null  int64  
dtypes: float64(25), int64(7)
memory usage: 27.0 MB
Image(dictionary_filepath+"5-Modelling-Data-Dictionary.png")
X =modelling_df.loc[:, modelling_df.columns != 'is_functioning']
y = modelling_df['is_functioning']

#check
print(X.shape)
print(y.shape)
(107184, 31)
(107184,)

Our independent variable (X) should have the same number of rows (107,184) than our dependent variable (y). y should only have one column as it is the outcome variable.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=rand_seed)
sm = SMOTE(random_state=rand_seed)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

#compre resampled dataset
print(f"Test set has {round(y_test.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_test.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Original train set has {round(y_train.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Resampled train set has {round(y_train_res.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train_res.value_counts(normalize=True)[1]*100,1)}% functioning")
Test set has 19.7% non-functioning water points and 80.3% functioning
Original train set has 19.5% non-functioning water points and 80.5% functioning
Resampled train set has 50.0% non-functioning water points and 50.0% functioning

We over-sample the minority class, non-functioning water points, to get an equal distribution of our outcome variable. Note this should be done on the train set and not the test set as we should not tinker with the latter.

Note that we do not scale our data because XGBoost is not a distance-based model and thus does not need scaled data.

Running baseline model

start=time.time()

#instantiate and fit
XG_base = XGBClassifier(random_state=rand_seed, verbose=True).fit(X_train_res, y_train_res)

end=time.time()

time_fit_base=end-start

print(f"Time to fit the model on the training set is {round(time_fit_base,3)} seconds")
[10:19:48] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:576: 
Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[10:19:48] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Time to fit the model on the training set is 24.288 seconds

The XGBoost classifier is not from the sklearn library. It uses multiple regression trees to learn decision rules that are the most accurate, while preventing overfitting.

fpr_train_base, tpr_train_base, roc_auc_train_base, precision_train_base_plot, recall_train_base_plot, pr_auc_train_base, time_predict_train_base = print_report(XG_base, X_train_res, y_train_res)

#storing accuracy scores
accuracy_train_base, precision_train_base, recall_train_base, f1_train_base = get_scores(XG_base, X_train_res, y_train_res)
ROC AUC: 0.9243921709239664
PR AUC: 0.9280459264190174
              precision    recall  f1-score   support

           0       0.83      0.86      0.85     68984
           1       0.86      0.82      0.84     68984

    accuracy                           0.84    137968
   macro avg       0.84      0.84      0.84    137968
weighted avg       0.84      0.84      0.84    137968

Our training set has an accuracy score of 84%. It is expected to perform well on the training set.

fpr_test_base, tpr_test_base, roc_auc_test_base, precision_test_base_plot, recall_test_base_plot, pr_auc_test_base, time_predict_test_base = print_report(XG_base, X_test, y_test)

print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_base,3)} seconds")

#storing accuracy scores
accuracy_test_base, precision_test_base, recall_test_base, f1_test_base = get_scores(XG_base, X_test, y_test)
ROC AUC: 0.8087662003643492
PR AUC: 0.9404674571728564
              precision    recall  f1-score   support

           0       0.45      0.65      0.53      4221
           1       0.90      0.80      0.85     17216

    accuracy                           0.77     21437
   macro avg       0.67      0.73      0.69     21437
weighted avg       0.81      0.77      0.79     21437

Time to predict the outcome variable for the test set is 0.05 seconds

Our test set also has an accuracy score of 77%. Similarly to our other past models, it has a high f1 score of 0.85 (harmonic mean of precision and recall) for functioning water points but a low f1 score of 0.53 for non functioning points.

Narrowing down parameters

# set range of learning rate
learning_rate_range = np.arange(0.1, 5, 0.5)

#empty dataframe to store accuracy scores
accuracy_scores = pd.DataFrame()

for lr in learning_rate_range:

    #instantiate and fit
    XG = XGBClassifier(learning_rate=lr, random_state=rand_seed).fit(
        X_train_res, y_train_res)

    # store accuracy scores
    train_score = XG.score(X_train_res, y_train_res)
    test_score = XG.score(X_test, y_test)

    # append to list
    accuracy_scores = accuracy_scores.append(
        {'Learning rate': lr, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)

# visualise relationship between learning rate and accuracy
plt.figure()

plt.plot(accuracy_scores['Learning rate'],
         accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Learning rate'],
         accuracy_scores['Test_score'], label='test score', marker='.')

plt.xlabel('Learning rate')
plt.ylabel("Accuracy")

plt.title("Learning rate above 1.5 is counterproductive")
plt.legend(loc='best')
plt.grid()

plt.show()
[10:20:14] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:20:43] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:21:05] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:21:30] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:21:56] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:22:02] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:22:06] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:22:10] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:22:14] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:22:18] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

The learning rate tells the model how quickly it sould adapt to its errors. If the learning rate is high, it will make larger jumps after making an error. The trend is the same as with AdaBoost: a learning rate of more than 1.5 becomes counterproductive as the accuracy scores fall and usually lead to overfitting on the training set.

# set range of depth max
m_depth = [2**i for i in range(1, 8, 2)]

#empty dataframe to store accuracy scores
accuracy_scores = pd.DataFrame()

for md in m_depth:

    #instantiate and fit
    XG = XGBClassifier(max_depth=md, random_state=rand_seed).fit(
        X_train_res, y_train_res)

    # store accuracy scores
    train_score = XG.score(X_train_res, y_train_res)
    test_score = XG.score(X_test, y_test)

    # append to list
    accuracy_scores = accuracy_scores.append(
        {'Max depth': md, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)

# visualise relationship between accuracy and max depth
plt.figure()

plt.plot(accuracy_scores['Max depth'],
         accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Max depth'],
         accuracy_scores['Test_score'], label='test score', marker='.')


plt.xlabel('Maximum depth')
plt.ylabel("Accuracy")

plt.title("Max depth around 32 is ideal")
plt.legend(loc='best')
plt.grid()

plt.show()
[10:22:21] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:22:30] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:23:00] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:25:00] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

MAximum depth is the maximum splits that each tree computed in XGBoost can have. At some point (after around 32) the marginal return to a higher max depth is null, probably because these trees do not need any more splits to make a decision/classification. It is also linked to the fact that we only 32 features.

# set range of max leaves
max_leaf = [2**i for i in range(1,8,1)]

#empty dataframe to store accuracy scores
accuracy_scores = pd.DataFrame()

for ml in max_leaf:

    #instantiate and fit
    XG = XGBClassifier(max_leaves=ml, random_state=rand_seed).fit(
        X_train_res, y_train_res)

    # store accuracy scores
    train_score = XG.score(X_train_res, y_train_res)
    test_score = XG.score(X_test, y_test)

    # append to list
    accuracy_scores = accuracy_scores.append(
        {'Max leaf': ml, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)

# visualise relationship between max leaves and accuracy
plt.figure()

plt.plot(accuracy_scores['Max leaf'],
         accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Max leaf'],
         accuracy_scores['Test_score'], label='test score', marker='.')

plt.xlabel('Maximum number of leaves')
plt.ylabel("Accuracy")

plt.title("Max leaves does not affect accuracy")
plt.legend(loc='best')
plt.grid()

plt.show()
[10:27:41] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:28:00] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:28:20] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:28:40] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:29:04] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:29:24] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:29:44] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

It seems that choosing the maximum number of leaves a split can have is not useful. This makes sense as we only have 2 classes for our outcome variable.

# set range of gamma
gamma_range = np.arange(0.1, 1, 0.1)

#empty dataframe to store accuracy scores
accuracy_scores = pd.DataFrame()

for gam in gamma_range:

    #instantiate and fit
    XG = XGBClassifier(gamma=gam, random_state=rand_seed).fit(
        X_train_res, y_train_res)

    # store accuracy scores
    train_score = XG.score(X_train_res, y_train_res)
    test_score = XG.score(X_test, y_test)

    # append to list
    accuracy_scores = accuracy_scores.append(
        {'Gamma': gam, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)

# visualise relationship between gamma and accuracy
plt.figure()

plt.plot(accuracy_scores['Gamma'],
         accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Gamma'],
         accuracy_scores['Test_score'], label='test score', marker='.')

plt.xlabel('Gamma')
plt.ylabel("Accuracy")

plt.title("Gamma regularisation has minimal impact on accuracy")
plt.legend(loc='best')
plt.grid()

plt.show()
[10:30:04] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:30:29] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:30:55] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:31:20] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:31:48] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:32:13] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:32:37] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:33:04] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:33:26] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

Gamma is a regularisation parameter. It attempts to prevent overfitting by only letting trees split when the associated gain with that split is larger than gamma. Here it doesn't seem to have a very obvious impact on our accuracy. We will use gamma in our grid search to be sure of it's ideal value.

Finding optimal hyperparameters

We run a randomised cross validation through a pipeline to find the optimal hyperparameters. We choose a randomised as opposed to a grid search because adaboost models are very expensive. For the same reasons mentioned above, we stick to the default number of trees, 50.

We choose not test for PCA (dimensionality reduction) because the majority of our models (including AdaBoost) did not choose to reduce dimensions. The AdaBoost model had to run for 20minutes, so we will avoid this by not testing for PCA as we do not have any confidence that it will be useful for XGBoost.

learning_rate_range = np.arange(0.1, 1.5, 0.2)
m_depth = range(8,24,1)
gamma_range = np.arange(0.1,1,0.1)

# setting up which models/scalers we want to grid search
estimator = [('XG', XGBClassifier(n_estimators=10, random_state=rand_seed))]

# defining distribution of parameters we want to compare
param_distrib = {'XG__learning_rate': learning_rate_range,
'XG__max_depth': m_depth,
'XG__gamma': gamma_range}

# run cross validation
pipeline_cross_val_random(estimator, param_distrib, X_train_res, y_train_res, X_test, y_test)
[10:33:50] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:33:56] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:34:03] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:34:09] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:34:15] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:34:22] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:34:29] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:34:37] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:34:45] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:34:54] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:35:02] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:35:07] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:35:13] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:35:18] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:35:22] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:35:28] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:35:36] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:35:45] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:35:56] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:36:07] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:36:18] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:36:26] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:36:34] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:36:41] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:36:49] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:36:56] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:01] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:06] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:11] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:17] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:22] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:26] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:31] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:35] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:38] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:42] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:47] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:50] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:54] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:37:57] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:01] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:07] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:14] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:21] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:27] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:33] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:38] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:42] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:47] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:51] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[10:38:56] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
The model with the best CV score has the following parameters: {'XG__max_depth': 21, 'XG__learning_rate': 0.30000000000000004, 'XG__gamma': 0.9}.
The best model has an accuracy score of 0.7992722862340813 on the test set

The best model has a learning rate of 0.9, lower than the 1.3 rate found for AdaBoost. The max depth is 16 (half the number of our features) and a gamma of 0.3, this is a relatively low regularisation value.

Running optimised model

start=time.time()

#instantiate and fit
XG_opt = XGBClassifier(learning_rate=0.9, max_depth=16, gamma=0.3, random_state=rand_seed).fit(X_train_res, y_train_res)

end=time.time()

time_fit_opt=end-start

print(f"Time to fit the model on the training set is {round(time_fit_opt, 3)} seconds")
[10:39:07] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1645117948562/work/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Time to fit the model on the training set is 67.311 seconds

The time to fit the model is similar to the baseline model. The only thing we really changed here is to increase the learning rate from 1 to 1.3.

fpr_train_opt, tpr_train_opt, roc_auc_train_opt, precision_train_opt_plot, recall_train_opt_plot, pr_auc_train_opt, time_predict_train_opt = print_report(XG_opt, X_train_res, y_train_res)

#storing accuracy scores
accuracy_train_opt, precision_train_opt, recall_train_opt, f1_train_opt = get_scores(XG_opt, X_train_res, y_train_res)
ROC AUC: 0.9999925977016636
PR AUC: 0.9999925723646281
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     68984
           1       1.00      1.00      1.00     68984

    accuracy                           1.00    137968
   macro avg       1.00      1.00      1.00    137968
weighted avg       1.00      1.00      1.00    137968

Running our XGBoost with optimal hyperparameters pushes our accuracy score for our training set to 1. It is (nearly) perfect and might suggest some overfitting, even though we tried to avoid that with the gamma and max depth parameters.

fpr_test_opt, tpr_test_opt, roc_auc_test_opt, precision_test_opt_plot, recall_test_opt_plot, pr_auc_test_opt, time_predict_test_opt = print_report(XG_opt, X_test, y_test)

print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_opt,3)} seconds")

#storing accuracy scores
accuracy_test_opt, precision_test_opt, recall_test_opt, f1_test_opt = get_scores(XG_opt, X_test, y_test)
ROC AUC: 0.8075630350856797
PR AUC: 0.9378410440342522
              precision    recall  f1-score   support

           0       0.51      0.56      0.54      4221
           1       0.89      0.87      0.88     17216

    accuracy                           0.81     21437
   macro avg       0.70      0.71      0.71     21437
weighted avg       0.82      0.81      0.81     21437

Time to predict the outcome variable for the test set is 0.075 seconds

The test set with our optimised model performs much better. It makes significant improvements in the recall of functioning water points (it's identifying a larger proportion of all functioning points) in the precision of non-functioning points (it's non-functioning labels are more often correct). However, the recall of non-functioning points does worsen (it is identifying a lower proportion of all non-functioning points).

Comparing results

plot_curve_roc('XG', fpr_train_base, tpr_train_base, roc_auc_train_base, fpr_train_opt, tpr_train_opt, roc_auc_train_opt, fpr_test_base,
 tpr_test_base, roc_auc_test_base,  fpr_test_opt, tpr_test_opt, roc_auc_test_opt)

The optimised and baseline model have near identical AUCs. We go with the optimised model because its overall acuracy metrics are better.

Visualising feature importance

coeff_bar_chart(XG_base.feature_importances_, X.columns, t=False)

Again, we find that the number of conflicts/violent events becomes an important feature in our model. Installation year and region is also important. It seems that distance from school and urban areas are the weakest feature in explaining the functionality of a water point.

explainer = shap.TreeExplainer(XG_base)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
shap.summary_plot(shap_values, X_test, plot_size=(20,20))

We see that the strenght of the coefficient for crucialness is heavily impacted by certain data points being present in the dataset. This is also true for points installed after 2006 and for non-publicly managed water points, for their own respective coefficients.

for col in ['crucialness', 'perc_local_served', 'distance_to_tertiary']:
    shap.dependence_plot(col, shap_values, X_test)

We see that a curcialness score between 0 and 0.2 has a strong effect on the importance of crucialness in the model. After that point, the crucialness score is not as important. We do not see a very obvious trend for perc_local_served and distance_to_tertiary.

shap.initjs()

# plot SHAP values for first observation
shap.force_plot(explainer.expected_value, shap_values[0], features=X_test.iloc[0], feature_names=X_test.columns)
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

The average predicted water point functionality for the model is 0.005 (this can translate to 50% since it is scaled). For this observation, the predicted functionality is -0.91 (translated to a probability of around 30%).

We see that this water point has a low crucialness value, heavily decreasing the probability of this water point functioning. Similarly, the fact that the water point is installed after 2006 and has a usage capacity of 100 negatively affects this probability. On the other hand, the fact that the point serves a third of the local population increased the probability of that water point functioning.

shap_df = pd.DataFrame(shap_values, columns=X_test.columns)
plt.subplots(8,4, figsize=(16,36))

for i, col in enumerate(shap_df.columns, 1):
  plt.subplot(8,4, i)

  #choose low alpha for transparency
  plt.scatter(X_test[col], shap_df[col], alpha=0.1)

  plt.xlabel(col)
  plt.ylabel(f'SHAP value')
  
plt.tight_layout()
plt.show()
Image(dictionary_filepath+"6-Hypotheses.png")

Exporting

joblib.dump(XG_base, model_filepath+'xgboost_model.sav')
['/Users/thomasadler/Desktop/futuristic-platipus/models/xgboost_model.sav']
d = {'Model':['XGBoost'], 'Parameters':['Max depth=6, Gamma=0, Learning rate=0.3, Number of trees=50'], 'Accuracy Train': [accuracy_train_base],\
    'Precision Train': [precision_train_base], 'Recall Train': [recall_train_base], 'F1 Train': [f1_train_base], 'ROC AUC Train':[roc_auc_train_base],\
        'Accuracy Test': accuracy_test_base, 'Precision Test': [precision_test_base], 'Recall Test': [recall_test_base], 'F1 Test': [f1_test_base],\
            'ROC AUC Test':[roc_auc_test_base],'Time Fit': time_fit_base,\
                'Time Predict': time_predict_test_base,   "Precision Non-functioning Test":0.45, "Recall Non-functioning Test":0.65,\
                                "F1 Non-functioning Test":0.53, "Precision Functioning Test":0.90, "Recall Functioning Test":0.80,"F1 Functioning Test":0.85}

#to dataframe
best_model_result_df=pd.DataFrame(data=d)

#check
best_model_result_df
Model Parameters Accuracy Train Precision Train Recall Train F1 Train ROC AUC Train Accuracy Test Precision Test Recall Test F1 Test ROC AUC Test Time Fit Time Predict Precision Non-functioning Test Recall Non-functioning Test F1 Non-functioning Test Precision Functioning Test Recall Functioning Test F1 Functioning Test
0 XGBoost Max depth=6, Gamma=0, Learning rate=0.3, Numbe... 0.84184 0.842494 0.84184 0.841765 0.924392 0.771517 0.813022 0.771517 0.786053 0.808766 19.921147 0.038439 0.45 0.65 0.53 0.9 0.8 0.85
best_model_result_df.to_csv(model_filepath + 'xgboost_model.csv')
metrics=[fpr_train_base, tpr_train_base, fpr_test_base, tpr_test_base]
metrics_name=['fpr_train_base', 'tpr_train_base', 'fpr_test_base', 'tpr_test_base']

#save numpy arrays for model comparison
for metric, metric_name in zip(metrics, metrics_name):
    np.save(model_filepath+f'xgboost_{metric_name}', metric)