8. Decision Tree
- Decision Tree
- Preparing data
- Running baseline model
- Narrowing down parameters
- Finding optimal hyperparameters
- Running optimised model
- Comparing results
- Visualising feature importance
- Exporting
We run our third ML model, a decision tree, first with default parameters, then we attempt to tune hyperparameters to improve it. We also visualise various accuracy scores, the confusion matrix and the ROC curve. We end by dumping our best model for further comparison.
%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
modelling_df=pd.read_csv(data_filepath + 'master_modelling_df.csv', index_col=0)
#check
modelling_df.info()
Image(dictionary_filepath+"5-Modelling-Data-Dictionary.png")
X =modelling_df.loc[:, modelling_df.columns != 'is_functioning']
y = modelling_df['is_functioning']
#check
print(X.shape)
print(y.shape)
Our independent variable (X) should have the same number of rows (107,184) than our dependent variable (y). y should only have one column as it is the outcome variable.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=rand_seed)
sm = SMOTE(random_state=rand_seed)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
#compre resampled dataset
print(f"Test set has {round(y_test.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_test.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Original train set has {round(y_train.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Resampled train set has {round(y_train_res.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train_res.value_counts(normalize=True)[1]*100,1)}% functioning")
We over-sample the minority class, non-functioning water points, to get an equal distribution of our outcome variable. Note this should be done on the train set and not the test set as we should not tinker with the latter.
Note that we do not scale our data as decision tree is not a distance-based ML model.
start=time.time()
#instantiate and fit
DT_base = DecisionTreeClassifier(random_state=rand_seed).fit(X_train_res, y_train_res)
end=time.time()
time_fit_base=end-start
print(f"Time to fit the model on the training set is {round(time_fit_base,3)} seconds")
We can already see that decision trees are potentially very expensive.
fpr_train_base, tpr_train_base, roc_auc_train_base, precision_train_base_plot, recall_train_base_plot, pr_auc_train_base, time_predict_train_base = print_report(DT_base, X_train_res, y_train_res)
#storing accuracy scores
accuracy_train_base, precision_train_base, recall_train_base, f1_train_base = get_scores(DT_base, X_train_res, y_train_res)
As expected, our training set has close to perfect accuracy metrics. Decision trees continue to make decision rules to be able to put all (or close to all) observations in the right classification bucket.
fpr_test_base, tpr_test_base, roc_auc_test_base, precision_test_base_plot, recall_test_base_plot, pr_auc_test_base, time_predict_test_base = print_report(DT_base, X_test, y_test)
print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_base,3)} seconds")
#storing accuracy scores
accuracy_test_base, precision_test_base, recall_test_base, f1_test_base = get_scores(DT_base, X_test, y_test)
Our test set has an accuracy score of 77%. The precision and recall scores for non-functioning water points are especially low. The model only got 43% of its non-functioning predictions correct (precision). In addition, it only identified 55% of all non-functioning points (recall).
We have tried to see if the behaviour of our model changes based on which criterion it chooses: gini (purity) and entropy (information gain). The criterion is the metric that the model looks at when deciding how and whether to split a sample and make a decision rule/leaf. Overall, we see no significant difference between the two criterions. Sometimes, the graph only shows one of them, that is just because they superimposed and one is just hiding the curve. This is a good representation of how similar they are.
# set range of min sample leaf
m_sample_leaf = [2**i for i in range(1,8,1)]
#testing for two criterions, gini and entropy
accuracy_scores_gini = pd.DataFrame()
accuracy_scores_entropy = pd.DataFrame()
#for gini
for msf in m_sample_leaf:
#instantiate and fit
DT = DecisionTreeClassifier(min_samples_leaf=msf, criterion='gini', random_state=rand_seed).fit(
X_train_res, y_train_res)
# store accuracy scores
train_score = DT.score(X_train_res, y_train_res)
test_score = DT.score(X_test, y_test)
# append to list
accuracy_scores_gini = accuracy_scores_gini.append(
{'Min sample leaf': msf, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
#for entropy
for msf in m_sample_leaf:
#instantiate and fit
DT = DecisionTreeClassifier(min_samples_leaf=msf, criterion='gini', random_state=rand_seed).fit(
X_train_res, y_train_res)
# store accuracy scores
train_score = DT.score(X_train_res, y_train_res)
test_score = DT.score(X_test, y_test)
# append to list
accuracy_scores_entropy = accuracy_scores_entropy.append(
{'Min sample leaf': msf, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
# visualise relationship between accuracy and max depth and criterion
plt.figure()
plt.plot(accuracy_scores_entropy['Min sample leaf'],
accuracy_scores_entropy['Train_score'], label='train score entropy', color='red', marker='.')
plt.plot(accuracy_scores_entropy['Min sample leaf'],
accuracy_scores_entropy['Test_score'], label='test score entropy', color='indianred', marker='.')
plt.plot(accuracy_scores_gini['Min sample leaf'],
accuracy_scores_gini['Train_score'], label='train score gini', color='royalblue', marker='.')
plt.plot(accuracy_scores_gini['Min sample leaf'],
accuracy_scores_gini['Test_score'], label='test score gini', color='navy', marker='.')
plt.xlabel('Minimum sample per leaf')
plt.ylabel("Accuracy")
plt.title("Higher minimum sample per leaf hurts accuracy")
plt.legend(loc='best')
plt.grid()
plt.show()
As the minimum sample per leaf increases, the model accuracy decreases. The higher this number is, the more restrictive the model is and the less leaves it can create, and thus classify each observation in their correct "buckets". This might be able to prevent overfitting as it is more general and does not perfectly match the training set.
# set range of depth max
m_depth = [2**i for i in range(1, 8,1)]
#testing for two criterions, gini and entropy
accuracy_scores_gini = pd.DataFrame()
accuracy_scores_entropy = pd.DataFrame()
#for gini
for md in m_depth:
#instantiate and fit
DT = DecisionTreeClassifier(max_depth=md, criterion='gini', random_state=rand_seed).fit(
X_train_res, y_train_res)
# store accuracy scores
train_score = DT.score(X_train_res, y_train_res)
test_score = DT.score(X_test, y_test)
# append to list
accuracy_scores_gini = accuracy_scores_gini.append(
{'Max depth': md, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
#for criterion
for md in m_depth:
#instantiate and fit
DT = DecisionTreeClassifier(max_depth=md, criterion='entropy', random_state=rand_seed).fit(
X_train_res, y_train_res)
# store accuracy scores
train_score = DT.score(X_train_res, y_train_res)
test_score = DT.score(X_test, y_test)
# append to list
accuracy_scores_entropy = accuracy_scores_entropy.append(
{'Max depth': md, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
# visualise relationship between accuracy and max depth and criterion
plt.figure()
plt.plot(accuracy_scores_entropy['Max depth'],
accuracy_scores_entropy['Train_score'], label='train score entropy', color='red', marker='.')
plt.plot(accuracy_scores_entropy['Max depth'],
accuracy_scores_entropy['Test_score'], label='test score entropy', color='indianred', marker='.')
plt.plot(accuracy_scores_gini['Max depth'],
accuracy_scores_gini['Train_score'], label='train score gini', color='royalblue', marker='.')
plt.plot(accuracy_scores_gini['Max depth'],
accuracy_scores_gini['Test_score'], label='test score gini', color='navy', marker='.')
plt.xlabel('Maximum depth')
plt.ylabel("Accuracy")
plt.title("Max depth higher than 30 shows large overfitting")
plt.legend(loc='best')
plt.grid()
plt.show()
After the maximum depth passes the 32 mark, the model starts hugely overfitting, with a gap between train and test scores of close to 25 percentage points. As the max depth increases more, the scores start to stay constant, probably because the model does not need that big of a depth anyway. It seems that the region where accuracy scores are high and the gap between train and test scores are acceptable is between 8 to 32.
# set range of feature max
m_features = [2**i for i in range(1,5,1)]
#testing for two criterions, gini and entropy
accuracy_scores_gini = pd.DataFrame()
accuracy_scores_entropy = pd.DataFrame()
#for gini
for mf in m_features:
#instantiate and fit
DT = DecisionTreeClassifier(max_features=mf, criterion='gini', random_state=rand_seed).fit(
X_train_res, y_train_res)
# store accuracy scores
train_score = DT.score(X_train_res, y_train_res)
test_score = DT.score(X_test, y_test)
# append to list
accuracy_scores_gini = accuracy_scores_gini.append(
{'Max features': mf, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
#for criterion
for mf in m_features:
#instantiate and fit
DT = DecisionTreeClassifier(max_features=mf, criterion='entropy', random_state=rand_seed).fit(
X_train_res, y_train_res)
# store accuracy scores
train_score = DT.score(X_train_res, y_train_res)
test_score = DT.score(X_test, y_test)
# append to list
accuracy_scores_entropy = accuracy_scores_entropy.append(
{'Max features': mf, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
# visualise relationship between accuracy and max depth and criterion
plt.figure()
plt.plot(accuracy_scores_entropy['Max features'],
accuracy_scores_entropy['Train_score'], label='train score entropy', color='red', marker='.')
plt.plot(accuracy_scores_entropy['Max features'],
accuracy_scores_entropy['Test_score'], label='test score entropy', color='indianred', marker='.')
plt.plot(accuracy_scores_gini['Max features'],
accuracy_scores_gini['Train_score'], label='train score gini', color='royalblue', marker='.')
plt.plot(accuracy_scores_gini['Max features'],
accuracy_scores_gini['Test_score'], label='test score gini', color='navy', marker='.')
plt.xlabel('Maximum number of features')
plt.ylabel("Accuracy")
plt.title("Setting a feature limit has no effect on accuracy")
plt.legend(loc='best')
plt.grid()
plt.show()
We see that max features has not effect on the accuracy scores of our model. We will refrain from using that parameter when tuning our model to save computational power.
We run a randomised cross validation through a pipeline to find the optimal hyperparameters. We choose a randomised as opposed to a grid search because decision tree models are very expensive.
max_depth_range = range(8,40,1)
min_samples_leaf_range = range(16,64,1)
# setting up which models/scalers we want to grid search
estimator = [('reduce_dim', PCA()),
('DT', DecisionTreeClassifier())]
# defining distribution of parameters we want to compare
param_dist = {"DT__criterion": ['gini', 'entropy'],
'DT__max_depth': max_depth_range,
'DT__min_samples_leaf': min_samples_leaf_range,
'reduce_dim__n_components': [0.5, 0.6, 0.7, 0.8, 0.9, None]}
# run cross validation
pipeline_cross_val_random(estimator, param_dist, X_train_res, y_train_res, X_test, y_test)
The best model has a max depth of 32 and minimum samples per leaf of 35. It also chose the 'entropy' criterion, althought this is probably not as important as we have seen above.
start=time.time()
#instantiate and fit
DT_opt = DecisionTreeClassifier(min_samples_leaf=25, max_depth=24, criterion='entropy').fit(X_train_res, y_train_res)
end=time.time()
time_fit_opt=end-start
print(f"Time to fit the model on the training set is {round(time_fit_opt, 3)} seconds")
The time to fit the model is just as long as the baseline model.
fpr_train_opt, tpr_train_opt, roc_auc_train_opt, precision_train_opt_plot, recall_train_opt_plot, pr_auc_train_opt, time_predict_train_opt = print_report(DT_opt, X_train_res, y_train_res)
#storing accuracy scores
accuracy_train_opt, precision_train_opt, recall_train_opt, f1_train_opt = get_scores(DT_opt, X_train_res, y_train_res)
The various accuracy metrics for the training set have all decreased (down to around 83%) compared to the baseline model (all at 100%). This is great, because the baseline model was hugely overfitting the training set, as it had near perfect accuracy scores. Due to parameter tuning, we manage to prevent too large of an overfit.
fpr_test_opt, tpr_test_opt, roc_auc_test_opt, precision_test_opt_plot, recall_test_opt_plot, pr_auc_test_opt, time_predict_test_opt = print_report(DT_opt, X_test, y_test)
print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_opt,3)} seconds")
#storing accuracy scores
accuracy_test_opt, precision_test_opt, recall_test_opt, f1_test_opt = get_scores(DT_opt, X_test, y_test)
Although the accuracy score is slightly lower for our optimised model, it has much better recall scores. This means the model is getting better at recognising functioning and non-functioning water points. For example, it has decreased the number of false positives by nearly 2 percentage points.
plot_curve_roc('DT', fpr_train_base, tpr_train_base, roc_auc_train_base, fpr_train_opt, tpr_train_opt, roc_auc_train_opt, fpr_test_base,
tpr_test_base, roc_auc_test_base, fpr_test_opt, tpr_test_opt, roc_auc_test_opt)
The optimised model has a worse AUC on the train set because it is being prevented from overfitting. We are more concerned with the test set AUC. Here the optimised model performs significantly better. The model is performing so much better on test, unseen, data. As a result, the optimised model is our best one here.
coeff_bar_chart(DT_opt.feature_importances_, X.columns, t=False)
We see that water points which are installed after 2006 and how crucial they are are two very important features for the decision tree's splitting. The latitude and longitude of the points are also very important, which is probably why the region dummy columns are not given much importance.
We see, notably, that the number of conflicts/violent events are not very important for our model.
Public management and water point complexity are also important drivers for our model's accuracy.
Image(dictionary_filepath+"6-Hypotheses.png")
joblib.dump(DT_opt, model_filepath+'decision_tree_model.sav')
d = {'Model':['Decision Tree'], 'Parameters':['Max depth=24, Min samples leaf=25, Criterion=entropy'], 'Accuracy Train': [accuracy_train_opt],\
'Precision Train': [precision_train_opt], 'Recall Train': [recall_train_opt], 'F1 Train': [f1_train_opt], 'ROC AUC Train':[roc_auc_train_opt],\
'Accuracy Test': accuracy_test_opt, 'Precision Test': [precision_test_opt], 'Recall Test': [recall_test_opt], 'F1 Test': [f1_test_opt],\
'ROC AUC Test':[roc_auc_test_opt], 'Time Fit': time_fit_opt,\
'Time Predict': time_predict_test_opt, "Precision Non-functioning Test":0.41, "Recall Non-functioning Test":0.62,\
"F1 Non-functioning Test":0.49, "Precision Functioning Test":0.89, "Recall Functioning Test":0.78,"F1 Functioning Test":0.83}
#to dataframe
best_model_result_df=pd.DataFrame(data=d)
#check
best_model_result_df
best_model_result_df.to_csv(model_filepath + 'decision_tree_model.csv')
metrics=[fpr_train_opt, tpr_train_opt, fpr_test_opt, tpr_test_opt]
metrics_name=['fpr_train_opt', 'tpr_train_opt', 'fpr_test_opt', 'tpr_test_opt']
#save numpy arrays for model comparison
for metric, metric_name in zip(metrics, metrics_name):
np.save(model_filepath+f'decision_tree_{metric_name}', metric)