15. WPDx Model
Extracting predictions from the WPDx model for further comparison
We extract the predictions of the model of the WPDx. We visualise its performance and save it for comparison with our own model.
The Water Point Data Exchange have come up with their own predictor of water point functionality, with a much more limited dataset and using basic ML models. We will extract key accuracy metrics to be compared to our other models.
%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
%run /Users/thomasadler/Desktop/futuristic-platipus/keys.py
socrata_domain = 'data.waterpointdata.org'
socrata_dataset_identifier = '9pn9-g5u4'
socrata_token = os.environ.get(water_api_key)
client = Socrata(socrata_domain, socrata_token, timeout=10)
water_uganda_query = """
select
*
limit
200000
"""
results = client.get(socrata_dataset_identifier, query=water_uganda_query)
wpdx_prediction_df_raw = pd.DataFrame.from_records(results)
wpdx_prediction_df = wpdx_prediction_df_raw.copy()
wpdx_prediction_df.head()
wpdx_prediction_df.info()
wpdx_prediction_df=wpdx_prediction_df[['status_id', 'prediction' ]]
#check
wpdx_prediction_df.info()
wpdx_prediction_df.isna().sum().sum()
for col in ['prediction']:
float_converter(wpdx_prediction_df, col)
# check
wpdx_prediction_df.info()
wpdx_prediction_df['status_id'] = np.where(wpdx_prediction_df['status_id']==True, 1, 0)
# check
wpdx_prediction_df.info()
wpdx_prediction_df['y_prediction_wpdx']=np.where(wpdx_prediction_df['prediction']>0.5,1,0)
wpdx_prediction_df['y_prediction_wpdx'].value_counts(normalize=True)
wpdx_prediction_df.rename(columns={'status_id':"y_real_wpdx", 'prediction':'y_proba_wpdx'}, inplace=True)
#check
wpdx_prediction_df.info()
Image(images_filepath+"WPDx_Methodology.png")
fpr_wpdx, tpr_wpdx, thresholds_roc_wpdx = roc_curve(wpdx_prediction_df['y_real_wpdx'], wpdx_prediction_df['y_proba_wpdx'])
#getting precision/recall scores
precision_wpdx, recall_wpdx, thresholds_pr_wpdx = precision_recall_curve(wpdx_prediction_df['y_real_wpdx'], wpdx_prediction_df['y_proba_wpdx'])
# storing values
roc_auc_wpdx = auc(fpr_wpdx, tpr_wpdx)
pr_auc_wpdx=auc(recall_wpdx, precision_wpdx)
# seeing model results
print(f'ROC AUC: {roc_auc_wpdx}')
print(f'PR AUC: {pr_auc_wpdx}')
print(classification_report(wpdx_prediction_df['y_real_wpdx'], wpdx_prediction_df['y_prediction_wpdx']))
Their documentation states that they have had the same goal as us: not missing non-functioning water points. They attempted to do so by optimising their model to have a high recall score. They end up with a recall score for non-functioning water points of 0.4.
cf_matrix=confusion_matrix(wpdx_prediction_df['y_real_wpdx'], wpdx_prediction_df['y_prediction_wpdx'])
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
ax = sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Greens')
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
ax.xaxis.set_ticklabels(['Not functioning','Functioning'])
ax.yaxis.set_ticklabels(['Not functioning','Functioning'])
plt.show()
Similar to our models, the precision and recall score for functioning water points are very high. This is because the sample data is very balanced and has a high number of functioning water points (around 85% are functioning).
d = {'Model':['WPDx'], 'Parameters':['Unknown'], 'Accuracy Train': [None],\
'Precision Train': [None], 'Recall Train': [None], 'F1 Train': [None], 'ROC AUC Train':[None],\
'Accuracy Test': None, 'Precision Test': [precision_wpdx], 'Recall Test': [recall_wpdx], 'F1 Test': [None],\
'ROC AUC Test':[roc_auc_wpdx],'Time Fit': [None],\
'Time Predict': [None], "Precision Non-functioning Test":0.54, "Recall Non-functioning Test":0.40,\
"F1 Non-functioning Test":0.46, "Precision Functioning Test":0.87, "Recall Functioning Test":0.92,"F1 Functioning Test":0.89}
#to dataframe
best_model_result_df=pd.DataFrame(data=d)
#check
best_model_result_df
best_model_result_df.to_csv(model_filepath + 'wpdx_model.csv')
np.save(model_filepath+f'wpdx_fpr_wpdx', fpr_wpdx)
np.save(model_filepath+f'wpdx_tpr_wpdx', tpr_wpdx)