9. Random Forest
- Random Forest
- Preparing data
- Running baseline model
- Narrowing down parameters
- Finding optimal hyperparameters
- Running optimised model
- Comparing results
- Visualising feature importance
- Exporting
We run our fourth ML model, a random forest, first with default parameters, then we attempt to tune hyperparameters to improve it. We also visualise various accuracy scores, the confusion matrix and the ROC curve. We end by dumping our best model for further comparison.
%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
modelling_df=pd.read_csv(data_filepath + 'master_modelling_df.csv', index_col=0)
#check
modelling_df.info()
Image(dictionary_filepath+"5-Modelling-Data-Dictionary.png")
X =modelling_df.loc[:, modelling_df.columns != 'is_functioning']
y = modelling_df['is_functioning']
#check
print(X.shape)
print(y.shape)
Our independent variable (X) should have the same number of rows (107,184) than our dependent variable (y). y should only have one column as it is the outcome variable.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=rand_seed)
sm = SMOTE(random_state=rand_seed)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
#compre resampled dataset
print(f"Test set has {round(y_test.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_test.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Original train set has {round(y_train.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Resampled train set has {round(y_train_res.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train_res.value_counts(normalize=True)[1]*100,1)}% functioning")
We over-sample the minority class, non-functioning water points, to get an equal distribution of our outcome variable. Note this should be done on the train set and not the test set as we should not tinker with the latter.
Note that we do not scale our data as random forest is not a distance-based ML model.
start=time.time()
#instantiate and fit
RF_base = RandomForestClassifier(random_state=rand_seed).fit(X_train_res, y_train_res)
end=time.time()
time_fit_base=end-start
print(f"Time to fit the model on the training set is {round(time_fit_base,3)} seconds")
We can already see that random forests are potentially very expensive. They are essentially multiple decision tress, which are already expensive by themselves.
fpr_train_base, tpr_train_base, roc_auc_train_base, precision_train_base_plot, recall_train_base_plot, pr_auc_train_base, time_predict_train_base = print_report(RF_base, X_train_res, y_train_res)
#storing accuracy scores
accuracy_train_base, precision_train_base, recall_train_base, f1_train_base = get_scores(RF_base, X_train_res, y_train_res)
As expected, our training set has close to perfect accuracy metrics. Random forests are composed of muktiple decision trees, which continue to make decision rules to be able to put all (or close to all) observations in the right classification bucket. For example, we only end up with 80 water points misclassified, representing less than 0.1% of the training dataset.
fpr_test_base, tpr_test_base, roc_auc_test_base, precision_test_base_plot, recall_test_base_plot, pr_auc_test_base, time_predict_test_base = print_report(RF_base, X_test, y_test)
print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_base,3)} seconds")
#storing accuracy scores
accuracy_test_base, precision_test_base, recall_test_base, f1_test_base = get_scores(RF_base, X_test, y_test)
Our test set has an accuracy score of 81%. This is a very good score for a first run, baseline model. The precision and recall scores for non-functioning water points is quite low though while those for functioning points is close to 90%.
We have seen in the decision tree model, that the choice of criterion (gini-purity or entropy-information gain) does not yield very different results. As a result, for this exploratory analysis we will not differentiate between the two.
# set range of sample leaf
m_sample_leaf = [2**i for i in range(1,8,1)]
#empty dataframe to store accuracy scores
accuracy_scores = pd.DataFrame()
for msf in m_sample_leaf:
#instantiate and fit
RF = RandomForestClassifier(n_estimators=50, min_samples_leaf=msf, random_state=rand_seed).fit(
X_train_res, y_train_res)
# store accuracy scores
train_score = RF.score(X_train_res, y_train_res)
test_score = RF.score(X_test, y_test)
# append to list
accuracy_scores = accuracy_scores.append(
{'Min sample leaf': msf, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
# visualise relationship between min sample leaf and accuracy
plt.figure()
plt.plot(accuracy_scores['Min sample leaf'],
accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Min sample leaf'],
accuracy_scores['Test_score'], label='test score', marker='.')
plt.xlabel('Minimum sample per leaf')
plt.ylabel("Accuracy")
plt.title("Increasing minimum samples per leaf decreases accuracy")
plt.legend(loc='best')
plt.grid()
plt.show()
As the minimum sample per leaf increases, the model accuracy decreases. This has the same shape and trend than single decision trees. Higher minimum samples per leaf might, however, prevent overfitting as it is more general and does not perfectly match the training set.
# set range of depth max
m_depth = [2**i for i in range(1, 8,1)]
#empty dataframe to store accuracy scores
accuracy_scores = pd.DataFrame()
for md in m_depth:
#instantiate and fit
RF = RandomForestClassifier(max_depth=md, random_state=rand_seed).fit(
X_train_res, y_train_res)
# store accuracy scores
train_score = RF.score(X_train_res, y_train_res)
test_score = RF.score(X_test, y_test)
# append to list
accuracy_scores = accuracy_scores.append(
{'Max depth': md, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
# visualise relationship between accuracy and max depth
plt.figure()
plt.plot(accuracy_scores['Max depth'],
accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Max depth'],
accuracy_scores['Test_score'], label='test score', marker='.')
plt.xlabel('Maximum depth')
plt.ylabel("Accuracy")
plt.title("Benefits of increasing max depth stop after 32")
plt.legend(loc='best')
plt.grid()
plt.show()
After the maximum depth passes the 32 mark, the model starts overfitting. As the max depth increases after 32, it seems the model does not need that big of a depth. This might make sense as we only have 32 features and the model is making one decision rule for each feature.
# set range of depth max
n_estimators_range = [2, 4, 16, 25, 50, 100]
#empty dataframe to store results
accuracy_scores = pd.DataFrame()
#for gini
for n_est in n_estimators_range:
#instantiate and fit
RF = RandomForestClassifier(n_estimators=n_est, random_state=rand_seed).fit(
X_train_res, y_train_res)
# store accuracy scores
train_score = RF.score(X_train_res, y_train_res)
test_score = RF.score(X_test, y_test)
# append to list
accuracy_scores = accuracy_scores.append(
{'Number trees': n_est, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
# visualise relationship between number of trees and accuracy scores
plt.figure()
plt.plot(accuracy_scores['Number trees'],
accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Number trees'],
accuracy_scores['Test_score'], label='test score', marker='.')
plt.xlabel('Number of trees')
plt.ylabel("Accuracy")
plt.title("More than 25 trees does not increase accuracy")
plt.legend(loc='best')
plt.grid()
plt.show()
A random forest is an ensemble of decision trees. Each of these trees are labelling each observation in a certain way. Then the random forest averages/takes a vote for each observation to make its final classification decision. We are looking at the number of decision trees that the random forest is running. It seems that after 25 trees, accuracy scores start to stagnate around 82% for the test set.
We run a randomised cross validation through a pipeline to find the optimal hyperparameters. We choose a randomised as opposed to a grid search because decision trees and random forest models are very expensive.
max_depth_range = range(0,40,1)
min_samples_leaf_range = range(1,50,1)
# setting up which models/scalers we want to grid search
estimator = [('reduce_dim', PCA()), ('RF', RandomForestClassifier(n_estimators=25, random_state=rand_seed))]
# defining distribution of parameters we want to compare
param_dist = {"RF__criterion": ['gini', 'entropy'],
'RF__max_depth': max_depth_range,
'RF__min_samples_leaf': min_samples_leaf_range,
'reduce_dim__n_components': [0.5, 0.6, 0.7, 0.8, 0.9, None]}
# run cross validation
pipeline_cross_val_random(estimator, param_dist, X_train_res, y_train_res, X_test, y_test)
The best model has a max depth of 31 (vs 32 for the previous individual decision tree model) and minimum samples per leaf of 11, similar to our previous decision tree (9). It also chose the 'entropy' criterion, just like the individual decision tree.
start=time.time()
#instantiate and fit
RF_opt = RandomForestClassifier(bootstrap=True, oob_score=True, min_samples_leaf=11, max_depth=31, n_estimators=100, criterion='entropy', random_state=rand_seed).fit(X_train_res, y_train_res)
end=time.time()
time_fit_opt=end-start
print(f"Time to fit the model on the training set is {round(time_fit_opt, 3)} seconds")
The time to fit the model is shorter compared to the baseline model. We set the bootstrap and oob_score parameters to True so we can extract the importance of individual features later on.
fpr_train_opt, tpr_train_opt, roc_auc_train_opt, precision_train_opt_plot, recall_train_opt_plot, pr_auc_train_opt, time_predict_train_opt = print_report(RF_opt, X_train_res, y_train_res)
#storing accuracy scores
accuracy_train_opt, precision_train_opt, recall_train_opt, f1_train_opt = get_scores(RF_opt, X_train_res, y_train_res)
The various accuracy metrics for the training set have all decreased (down to around 94%) compared to the baseline model (all at 100%). This is a good sign because the baseline model was hugely overfitting the training set while the optimised model i hopefully not.
fpr_test_opt, tpr_test_opt, roc_auc_test_opt, precision_test_opt_plot, recall_test_opt_plot, pr_auc_test_opt, time_predict_test_opt = print_report(RF_opt, X_test, y_test)
print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_opt,3)} seconds")
#storing accuracy scores
accuracy_test_opt, precision_test_opt, recall_test_opt, f1_test_opt = get_scores(RF_opt, X_test, y_test)
The accuracy score is slightly lower for our optimised model. It has worsened its precision score for non-functioning points while improving its recall score. This means it is getting better at identifying more non-functioning water points, at the expense of labelling more functioning water points as functioning. As a result, we go with the optimised model as the recall score for non-functioning water points is what we care most about.
plot_curve_roc('RF', fpr_train_base, tpr_train_base, roc_auc_train_base, fpr_train_opt, tpr_train_opt, roc_auc_train_opt, fpr_test_base,
tpr_test_base, roc_auc_test_base, fpr_test_opt, tpr_test_opt, roc_auc_test_opt)
The optimised model has a worse AUC on the train set because it is being prevented from overfitting. We are more concerned with the test set AUC. Here the optimised model performs the same as the baseline model.
coeff_bar_chart(RF_opt.feature_importances_, X.columns, t=False)
We see a water points' importance to their crucial to their communities and the proportion of the community being served by the community are two very important features for the decision tree's splitting. The latitude and longitude of the points are also very important, which is probably why the region dummy columns are not given much importance. We see, notably, that the number of conflicts/violent events are not very important for our model.
Image(dictionary_filepath+"6-Hypotheses.png")
#joblib.dump(RF_opt, model_filepath+'random_forest_model.sav')
d = {'Model':['Random Forest'], 'Parameters':['Max depth=31, Min samples leaf=11, Criterion=entropy, Number trees=100'], 'Accuracy Train': [accuracy_train_opt],\
'Precision Train': [precision_train_opt], 'Recall Train': [recall_train_opt], 'F1 Train': [f1_train_opt], 'ROC AUC Train':[roc_auc_train_opt],\
'Accuracy Test': accuracy_test_opt, 'Precision Test': [precision_test_opt], 'Recall Test': [recall_test_opt], 'F1 Test': [f1_test_opt],\
'ROC AUC Test':[roc_auc_test_opt],'Time Fit': time_fit_opt,\
'Time Predict': time_predict_test_opt, "Precision Non-functioning Test":0.48, "Recall Non-functioning Test":0.62,\
"F1 Non-functioning Test":0.54, "Precision Functioning Test":0.90, "Recall Functioning Test":0.83,"F1 Functioning Test":0.87}
#to dataframe
best_model_result_df=pd.DataFrame(data=d)
#check
best_model_result_df
best_model_result_df.to_csv(model_filepath + 'random_forest_model.csv')
metrics=[fpr_train_opt, tpr_train_opt, fpr_test_opt, tpr_test_opt]
metrics_name=['fpr_train_opt', 'tpr_train_opt', 'fpr_test_opt', 'tpr_test_opt']
#save numpy arrays for model comparison
for metric, metric_name in zip(metrics, metrics_name):
np.save(model_filepath+f'random_forest_{metric_name}', metric)