6. Logistic Regression
- Aim
- Random Guess
- Logistic Regression
- Preparing data
- Running baseline model
- Upsampling non-functioning water points
- Narrowing down parameters
- PCA Analysis
- Finding optimal hyperparameters
- Running optimised model
- Comparing results
- Visualising feature importance
- Exporting
We run our first ML model, a logistic regression, first with default parameters, then we attempt to tune hyperparameters to improve it. We also visualise various accuracy scores, the confusion matrix and the ROC curve. We end by dumping our best model for further comparison.
The aim of our model is, of course to get a high accuracy score. However, the most important thing is that we do not want to miss non-functioning water points. We would rather have more false negatives (point labelled as non-functioning is actually fine) than false positives (point labelled as functioning is actually not). This will be calculated by the recall score for the 0/negative class. We want it to be as high as possible.
The intuition is that we would rather send an engineer to a functioning water point by mistake than not repair a non-functioning water point because we missed it. Both are costly and not ideal, but the cost to a whole community prevented from accessing water is worse than paying for an engineer for a repair that does not need to be made.
Our very first "model" or benchmark is the accuracy score we would get if guessed blindly. Since our outcome variable is binary, if we randomly tried to predict each observation, we would get, on average an accuracy score of 50%. Any model close to this blind guess would probably be close to useless (if not worse!)
This 50% accuracy score would happen if we did not have any prior knowledge about the distribution of our outcome variable. However, if we know that the distribution of our outcome variable (in this case 80% are 1s), we could get an accuracy score of 80%, without any kind of computation
%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
modelling_df=pd.read_csv(data_filepath + 'master_modelling_df.csv', index_col=0)
#check
modelling_df.info()
Image(dictionary_filepath+"5-Modelling-Data-Dictionary.png")
X =modelling_df.loc[:, modelling_df.columns != 'is_functioning']
y = modelling_df['is_functioning']
#check
print(X.shape)
print(y.shape)
Our independent variable (X) should have the same number of rows (107,184) than our dependent variable (y). y should only have one column as it is the outcome variable.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=rand_seed)
print(f"Test set has {round(y_test.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_test.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Train set has {round(y_train.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train.value_counts(normalize=True)[1]*100,1)}% functioning")
Our distribution of the outcome variable might increase the risk of our model recognising and labelling functioning water points much more often and better than non-functioning one. This might be an issue to resolve later on.
start=time.time()
#instantiate and fit
LR_base = LogisticRegression(random_state=rand_seed).fit(X_train, y_train)
end=time.time()
time_fit_base=end-start
print(f"Time to fit the model on the training set is {round(time_fit_base, 3)} seconds")
fpr_train_base, tpr_train_base, roc_auc_train_base, precision_train_base_plot, recall_train_base_plot, pr_auc_train_base, time_predict_train_base = print_report(LR_base, X_train, y_train)
#storing accuracy scores
accuracy_train_base, precision_train_base, recall_train_base, f1_train_base = get_scores(LR_base, X_train, y_train)
Our basic logistic regression has a relatively good accuracy score of 80% on the training set. However, this is only because our model is currently labelling nearly all water points as functioning. This is how we end up with a huge False Positive Rate (FPR) of close to 20%. This makes our recall and f1 score for non-functioning water points go down to 0.
fpr_test_base, tpr_test_base, roc_auc_test_base, precision_test_base_plot, recall_test_base_plot, pr_auc_test_base, time_predict_test_base = print_report(LR_base, X_test, y_test)
print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_base,3)} seconds")
#storing accuracy scores
accuracy_test_base, precision_test_base, recall_test_base, f1_test_base = get_scores(LR_base, X_test, y_test)
Similarly to the training set, the model predictions for the test set are pretty useless. One could just label all points functioning and end up with the same result, without actually doing a single calculation. As mentioned above, this is due to our sample being unbalanced, where 80% of our water point observations are functioning. As a result, it does not identify any kind of patterns for non-functioning water points.
sm = SMOTE(random_state=rand_seed)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
#compre resampled dataset
print(f"Original train set has {round(y_train.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Resampled train set has {round(y_train_res.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train_res.value_counts(normalize=True)[1]*100,1)}% functioning")
We over-sample the minority class, non-functioning water points, to get an equal distribution of our outcome variable. Note this should be done on the train set and not the test set as we should not tinker with the latter.
# set range of penalties
c_range = [10**i for i in np.arange(-7, 7, 0.5)]
accuracy_scores = pd.DataFrame()
for c in c_range:
#instantiate and fit
LR = LogisticRegression(C=c, random_state=rand_seed).fit(
X_train_res, y_train_res)
# store accuracy scores
train_score = LR.score(X_train_res, y_train_res)
test_score = LR.score(X_test, y_test)
# append to list
accuracy_scores = accuracy_scores.append(
{'Penalty': c, 'Train_score': train_score, 'Test_score': test_score}, ignore_index=True)
# visualise relationship between penalty and accuracy
plt.figure()
plt.plot(accuracy_scores['Penalty'],
accuracy_scores['Train_score'], label='train score', marker='.')
plt.plot(accuracy_scores['Penalty'],
accuracy_scores['Test_score'], label='test score', marker='.')
plt.xscale('log')
plt.xlabel('Penalty')
plt.ylabel("Accuracy")
plt.title("Penalty effect on train and test accuracy scores is unclear ")
plt.legend(loc='best')
plt.grid()
plt.show()
There is not a very clear trend between the accuracy scores on the train and test sets and the magnitude of the penalty. The penalty should help with preventing overfitting, keeping these two curves closer to each other. We see that this might be happening after we pass the penalty threshold of $10^-4$. If we had more granular data points, we might see that the curves are much closer to each other more consistently.
Principal Component Analysis (PCA) is a method reduce the dimensions in a dataset, while keeping the accuracy score/performance high. The idea is that it excludes Principal Components which do not explain much of the variance in the dataset.
pca = PCA(n_components = None).fit(X_train_res)
We fit our PCA to our training set.
components = len(X.columns)
# plot a scree plot
plt.plot(range(1,components+1), np.cumsum(pca.explained_variance_ratio_ * 100))
plt.xlabel("Number of components")
plt.ylabel("Explained variance (%)")
plt.title("Half of our components explains the majority of our data")
plt.grid()
plt.show()
The scree plot above shows us the percentage of variance explained by each number of principal components. For example, the first principal component explains close to 3/4ths of the variance. As we add more principal components, the marginal improvement in explained variance slowly decreases. We get to around 100% of explained variance at around the 15th principal component (out of 32 principal components, as we have 32 features)
We run a grid search cross validation through a pipeline to find the optimal hyperparameters for our logistic regression. We choose the default scoring method of accuracy for our grid search. I tried different scoring methods and it did not change anything to the grid search. Since most models were run with the default the first time, we stick to accuracy as the scoring method for our grid search to optimise.
c_range = [10**i for i in np.arange(-6, 6, 0.5)]
c_range.insert(0, 0)
# setting up which models/scalers we want to grid search
estimator = [('scaling', StandardScaler()),
('reduce_dim', PCA()),
('LR', LogisticRegression(random_state=rand_seed))]
# defining parameters we want to compare
param = {'LR__penalty': ['l1', 'l2'],
'LR__C': c_range,
'reduce_dim__n_components': [0.5, 0.6, 0.7, 0.8, 0.9, None]}
# run cross validation
pipeline_cross_val_grid(estimator, param, X_train_res, y_train_res, X_test, y_test)
The best model here seems to have a much lower accuracy score, but hopefully it scores better on its precision and recall scores for non-functioning water points. Interestingly, the best model does not use PCA to reduce dimensions. It seems that regularising with L2 (Ridge) is a better alternative than PCA dimensionality reduction. L2 reduces the number of features by decreasing certain coefficients to (or close to) 0.
X_train_res_scaled, X_test_scaled = scaling(StandardScaler(), X_train_res, X_test)
Note that the scaler should be fitted on the training set and then should be used to transform the training and test set.
start=time.time()
#instantiate and fit
LR_opt = LogisticRegression(penalty='l2', C=0.01, random_state=rand_seed).fit(X_train_res_scaled, y_train_res)
end=time.time()
time_fit_opt=end-start
print(f"Time to fit the model on the training set is {round(time_fit_opt,3)} seconds")
The time to fit the model with our optimal parameter is similar to the baseline model, around 0.4 seconds.
fpr_train_opt, tpr_train_opt, roc_auc_train_opt, precision_train_opt_plot, recall_train_opt_plot, pr_auc_train_opt, time_predict_train_opt = print_report(LR_opt, X_train_res_scaled, y_train_res)
#storing accuracy scores
accuracy_train_opt, precision_train_opt, recall_train_opt, f1_train_opt = get_scores(LR_opt, X_train_res_scaled, y_train_res)
The accuracy score for the training set has plummeted from 80 to 65%. However, the precision, recall and f1 scores have all improved dramatically, all of them at the 65% mark.
fpr_test_opt, tpr_test_opt, roc_auc_test_opt, precision_test_opt_plot, recall_test_opt_plot, pr_auc_test_opt, time_predict_test_opt = print_report(LR_opt, X_test_scaled, y_test)
print(f"Time to predict the outcome variable for the test set is {round(time_predict_test_opt,3)} seconds")
#storing accuracy scores
accuracy_test_opt, precision_test_opt, recall_test_opt, f1_test_opt = get_scores(LR_opt, X_test_scaled, y_test)
The various accuracy metrics for functioning water points in our optimised model has decreased compared to our baseline model. However, these metrics for non-functioning water points have increased substantially. For example, only 9% of predictions are false positives, comapred to 20% previously. Our false negatives have increased to nearly a third, which is not great. The time to predict the test set has slightly improved from our base model, even though we specified more parameters.
plot_curve_roc('LR', fpr_train_base, tpr_train_base, roc_auc_train_base, fpr_train_opt, tpr_train_opt, roc_auc_train_opt, fpr_test_base,
tpr_test_base, roc_auc_test_base, fpr_test_opt, tpr_test_opt, roc_auc_test_opt)
The ROC curve shows us the performance of our models at different thresholds. Our optimised model performs much better than our baseline model on our training set. Our resampling and parameter optimization seems to have made the model better at identifying trends in the data. On the other hand, this improvement in accuracy is much smaller in our test set, although still there.
plot_curve_prec_recall('LR', recall_train_base_plot, precision_train_base_plot, pr_auc_train_base, recall_train_opt_plot, precision_train_opt_plot, pr_auc_train_opt,
recall_test_base_plot, precision_test_base_plot, pr_auc_test_base, recall_test_opt_plot, precision_test_opt_plot, pr_auc_test_opt)
The precision/recall curve is not a very useful visualisation as we care mostly about the recall score, and not so much it's relationship with precision. We will refrain from visualising these plots in the future.
coeff_bar_chart(LR_opt.coef_, X.columns, t=True)
The colors in the bar chart below represent the initial hypotheses that we have made. Positive (blue) means that we expect there to be a positive relationship between that variable and the functionality of a water point. Grey is neutral, as we believe the relationship might be more complex and red is a negative relationship. This enables us to assess whether our hypotheses has been validated or not.
Water points which were installed after 2006 have a much higher probability of being functioning. Not being in the Northern region and being in one where the bank account ownership rate is high also increases that probability. Being publicly managed and being of complex technology is also a sign of a functioning water point.
Interestingly, violent events and conflicts are low predictors of water point functionality. This might suggest that demographic and regional factors are the main drivers, and violence might just also be a result of these regional differences.
Health and prosperity metrics have little explanatory power in the functioning of a water point.
Finally, points which serve a large percentage of their local population and those which are very crucial to its neighboring communities increases the probability of that water point not functioning. We can imagine that overuse might be a major cause of this.
Regarding our hypotheses, we were especially wrong with the proportion of people in temporary dwelling. It seems to be associated with functioning water points. We also thought that higher rates of secondary school enrollment would be positively correlated with functionality, however, in this model, we find the opposite.
Image(dictionary_filepath+"6-Hypotheses.png")
joblib.dump(LR_opt, model_filepath+'logistic_regression_model.sav')
d = {'Model':['Logistic Regression'], 'Parameters':['Penalty=l2, C=0.01, Standard Scaler'], 'Accuracy Train': [accuracy_train_opt],\
'Precision Train': [precision_train_opt], 'Recall Train': [recall_train_opt], 'F1 Train': [f1_train_opt], 'ROC AUC Train':[roc_auc_train_opt],\
'Accuracy Test': accuracy_test_opt, 'Precision Test': [precision_test_opt], 'Recall Test': [recall_test_opt], 'F1 Test': [f1_test_opt],\
'ROC AUC Test':[roc_auc_test_opt], 'Time Fit': time_fit_opt,\
'Time Predict': time_predict_test_opt, "Precision Non-functioning Test": 0.27, "Recall Non-functioning Test": 0.55,\
"F1 Non-functioning Test": 0.36,"Precision Functioning Test": 0.86, "Recall Functioning Test": 0.65,"F1 Functioning Test": 0.74}
#to dataframe
best_model_result_df=pd.DataFrame(data=d)
#check
best_model_result_df
best_model_result_df.to_csv(model_filepath + 'logistic_regression_model.csv')
metrics=[fpr_train_opt, tpr_train_opt, fpr_test_opt, tpr_test_opt]
metrics_name=['fpr_train_opt', 'tpr_train_opt', 'fpr_test_opt', 'tpr_test_opt']
#save numpy arrays for model comparison
for metric, metric_name in zip(metrics, metrics_name):
np.save(model_filepath+f'logistic_regression_{metric_name}', metric)