14. Neural Network
- Neural Network
- Preparing data
- 1. Baseline model: choosing loss function and metric
- 2. Choosing layers
- 3. Choosing optimiser
- 4. Choosing regularization
- 5. Choosing dropout rate
- 6. Choosing batch normalization
- 7. Choosing activation function
- 8. Trying more epochs
- Optimal Model
- Analysis
- Comparing results
- Exporting
We run our ninth and final ML model, a neural network. We test different architectures, layers, loss functions, metrics, optimisers, regularization, dropout rates, batch normalizations, activation functions and epochs. We also visualise various accuracy scores, the confusion matrix and the ROC curve. We end by dumping our best model for further comparison.
We will be running a neural network with various architectures to attempt to optimise its performance.
%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
modelling_df=pd.read_csv(data_filepath + 'master_modelling_df.csv', index_col=0)
#check
modelling_df.info()
Image(dictionary_filepath+"5-Modelling-Data-Dictionary.png")
X =modelling_df.loc[:, modelling_df.columns != 'is_functioning']
y = modelling_df['is_functioning']
#check
print(X.shape)
print(y.shape)
Our independent variable (X) should have the same number of rows (107,184) than our dependent variable (y). y should only have one column as it is the outcome variable.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=rand_seed)
sm = SMOTE(random_state=rand_seed)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
#compre resampled dataset
print(f"Test set has {round(y_test.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_test.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Original train set has {round(y_train.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train.value_counts(normalize=True)[1]*100,1)}% functioning")
print(f"Resampled train set has {round(y_train_res.value_counts(normalize=True)[0]*100,1)}% non-functioning water points and {round(y_train_res.value_counts(normalize=True)[1]*100,1)}% functioning")
We over-sample the minority class, non-functioning water points, to get an equal distribution of our outcome variable. Note this should be done on the train set and not the test set as we should not tinker with the latter.
X_train_res_scaled, X_test_scaled = scaling(StandardScaler(), X_train_res, X_test)
We also need to scale the data as this should improve the accuracy of our neural network.
We will be testing various neural networks with different parameters and attempt to infer which ones respond best and provide the best accuracy score on the test set.
We will try running a neural network for our classification problem. Neural networks are good at identifying non-linear and complex relationships between things. Let's see if they add anything to our problem.
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="relu"))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
We try a first model with baseline, common parameters such as the Adam optimiser, Binary cross entropy as the loss function and the binary accuracy as our metric. The model has no hidden layer and just one input and output layer. One thing we can see is that the difference between the train and test accuracy is very low.
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="relu"))
#hidden layer
NN.add(layers.Dense(16, activation="relu"))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
We add a hidden layer of 16 nodes and the accuracy score for test set improves by around 3 percentage points.
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="relu"))
#hidden layers
NN.add(layers.Dense(16, activation="relu"))
NN.add(layers.Dense(8, activation="relu"))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
We add a second hidden layer of 8 nodes, but the accuracy score does not improve.
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="relu"))
# #hidden layer
NN.add(layers.Dense(8, activation="relu"))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
We look at a single hidden layer of 8 nodes instead of 16, the accuracy score is not better.
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="relu"))
#hidden layer
NN.add(layers.Dense(16, activation="relu"))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.SGD(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
Using a different optimiser, SGD, does not improve our model, we will stick the Adam optimiser.
#sequential model
NN = keras.Sequential()
#set regularization
regularizer = keras.regularizers.l2(0.01)
#input layer
NN.add(layers.Dense(32, activation="relu", kernel_regularizer=regularizer))
#hidden layer
NN.add(layers.Dense(16, activation="relu", kernel_regularizer=regularizer))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
#sequential model
NN = keras.Sequential()
#set regularization
regularizer = keras.regularizers.l1(0.01)
#input layer
NN.add(layers.Dense(32, activation="relu", kernel_regularizer=regularizer))
#hidden layer
NN.add(layers.Dense(16, activation="relu", kernel_regularizer=regularizer))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
Adding any kind of regulariser (l1-Lasso or l2-Ridge) hurts our neural network heavily. Regularization attempts to prevent overfitting in a model, here it does not do a good job.
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="relu"))
#hidden layer
NN.add(layers.Dense(16, activation="relu"))
NN.add(layers.Dropout(0.2))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
Adding a dropout rate of 20% for our hidden layer means that 80% of nodes are taken into account when running the neural network at every epoch. Using this technique does improve our accuracy scores.
#sequential model
NN = keras.Sequential()
#INPUT layer
NN.add(layers.Dense(32, activation="relu"))
#hidden layer
NN.add(layers.Dense(16, activation="relu"))
NN.add(layers.BatchNormalization())
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
Applying batch normalization on our hidden layer also does not improve our scores.
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="relu"))
#hidden layer
NN.add(layers.Dense(16, activation="sigmoid"))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="sigmoid"))
#hidden layer
NN.add(layers.Dense(16, activation="sigmoid"))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="relu"))
#hidden layer
NN.add(layers.Dense(16, activation="relu"))
#output layer
NN.add(layers.Dense(1, activation="relu"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="sigmoid"))
#hidden layer
NN.add(layers.Dense(16, activation="relu"))
#output layer
NN.add(layers.Dense(1, activation="relu"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="sigmoid"))
#hidden layer
NN.add(layers.Dense(16, activation="sigmoid"))
#output layer
NN.add(layers.Dense(1, activation="relu"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
We test different combinations of activation functions for our input, hidden and output layers. We see that the best combinations is using a sigmoid function for all three of these layers. The accuracy score for that model is not terrific, but better than anything else we've gotten so far.
#sequential model
NN = keras.Sequential()
#input layer
NN.add(layers.Dense(32, activation="sigmoid"))
#hidden layer
NN.add(layers.Dense(16, activation="sigmoid"))
#output layer
NN.add(layers.Dense(1, activation="sigmoid"))
#compile
NN.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#fit on training set
results = NN.fit(X_train_res_scaled, y_train_res, epochs=200, verbose=0)
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
NN_opt = keras.Sequential()
#input layer
NN_opt.add(layers.Dense(32, activation="sigmoid"))
#hidden layer
NN_opt.add(layers.Dense(16, activation="sigmoid"))
#output layer
NN_opt.add(layers.Dense(1, activation="sigmoid"))
#compile
NN_opt.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=[keras.metrics.BinaryAccuracy()])
#time process
start=time.time()
#fit on training set
results = NN_opt.fit(X_train_res_scaled, y_train_res, epochs=50, verbose=0, validation_data=(X_test, y_test))
end=time.time()
time_fit_opt=end-start
#get scores from neural network
train_accuracy = results.history["binary_accuracy"][-1]
result = NN_opt.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Time to fit the model on the training set is {round(time_fit_opt,3)} seconds")
print(f"Accuracy score on Train set: {train_accuracy}")
print(f"Accuracy score on Test set: {result[1]}")
We re-run our optimal model, it does not achieve the same accuracy score as before as the neural networks starts with random weight every time.
plt.figure()
plt.plot(results.epoch, results.history['binary_accuracy'])
plt.plot(results.epoch, results.history['val_binary_accuracy'])
plt.title('Binary Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'])
plt.grid()
plt.show()
The highest accuracy is achieved at around the 12th epoch. It seems like our dataset and relationships are not complex enough to be needing that many epochs to achieve a high accuracy score.
plt.figure()
plt.plot(results.epoch, results.history['loss'])
plt.plot(results.epoch, results.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'])
plt.grid()
plt.show()
Similarly the loss is minimised around the 12th epoch.
predictions_train_proba=NN_opt.predict(X_train_res_scaled)
#convert to class
predictions_train=np.where(predictions_train_proba>0.5, 1, 0)
fpr_train_opt, tpr_train_opt, thresholds_roc_train_opt = roc_curve(y_train_res, predictions_train_proba)
#getting precision/recall scores
precision_train_opt_plot, recall_train_opt_plot, thresholds_pr_train_opt = precision_recall_curve(y_train_res, predictions_train_proba)
# storing values
roc_auc_train_opt = auc(fpr_train_opt, tpr_train_opt)
pr_auc_train_opt = auc(recall_train_opt_plot, precision_train_opt_plot)
# seeing model results
print(f'ROC AUC: {roc_auc_train_opt}')
print(f'PR AUC: {pr_auc_train_opt}')
print(classification_report(y_train_res, predictions_train))
#print confusion matrix
cf_matrix=confusion_matrix(y_train_res, predictions_train)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
ax = sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Greens')
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
ax.xaxis.set_ticklabels(['Not functioning','Functioning'])
ax.yaxis.set_ticklabels(['Not functioning','Functioning'])
plt.show()
All of our accuracy metrics for our train set are just under the 80% mark.
start=time.time()
# prediction of our model on test set
predictions_test_proba=NN_opt.predict(X_test_scaled)
#convert to class
predictions_test=np.where(predictions_test_proba>0.5, 1, 0)
end=time.time()
time_predict_opt=end-start
print(f"Time to predict the model on the test set is {round(time_predict_opt,3)} seconds")
We can see that the time it takes to predict a class is relatively long compared to other models. We also need to add an additional step as Keras does not enable us to directly predict the class of an observation.
fpr_test_opt, tpr_test_opt, thresholds_roc_test_opt = roc_curve(y_test, predictions_test_proba)
#getting precision/recall scores
precision_test_opt_plot, recall_test_opt_plot, thresholds_pr_test_opt = precision_recall_curve(y_test, predictions_test_proba)
# storing values
roc_auc_test_opt = auc(fpr_test_opt, tpr_test_opt)
pr_auc_test_opt = auc(recall_test_opt_plot, precision_test_opt_plot)
# seeing model results
print(f'ROC AUC: {roc_auc_test_opt}')
print(f'PR AUC: {pr_auc_test_opt}')
print(classification_report(y_test, predictions_test))
#print confusion matrix
cf_matrix=confusion_matrix(y_test, predictions_test)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
ax = sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Greens')
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
ax.xaxis.set_ticklabels(['Not functioning','Functioning'])
ax.yaxis.set_ticklabels(['Not functioning','Functioning'])
plt.show()
The model performs relatively well in the recall for non-functioning water points. It is not "missing" as many of them as other models. As expected, the model performs very well for functioning water points, 90% of its functioning labels are correct.
plt.plot(figsize=(10,15))
plt.plot([0,1], [0,1], color='black', linestyle='--')
plt.title('Receiver Operating Characteristic (ROC) Curve - NN')
plt.plot(fpr_train_opt, tpr_train_opt, color='blueviolet', lw=2,
label='Train AUC = %0.2f' % roc_auc_train_opt)
plt.plot(fpr_test_opt, tpr_test_opt, color='crimson', lw=2,
label='Test AUC = %0.2f' % roc_auc_test_opt)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="best")
plt.tight_layout()
plt.grid()
The test set has a smaller, as expected, AUC compared to the train set, as it is an unseen dataset.
Similarly to our KNN model, we do not visualise the feature importance of our optimal neural network model. The reason is that we need to use SHAP, and it is computationally extremely expensive (it took nearly 10min for one run, it is recommended to run 50-100). We will consider using SHAP later on, when comparing models, if the neural network or KNN model is the best performing one. In this case, it might be worth using SHAP.
Image(dictionary_filepath+"6-Hypotheses.png")
joblib.dump(NN_opt, model_filepath+'neural_network_model.sav')
d = {'Model':['Neural Network'], 'Parameters':['Hidden layer=16 nodes, Optimizer=Adam, Loss function=BinaryCrossentropy, Metric=BinaryAccuracy'],\
'Accuracy Train': None,\
'Precision Train': None, 'Recall Train': None, 'F1 Train': None, 'ROC AUC Train':[roc_auc_train_opt],\
'Accuracy Test': None, 'Precision Test': None, 'Recall Test': None, 'F1 Test': None,\
'ROC AUC Test':[roc_auc_test_opt], 'Time Fit': time_fit_opt,\
'Time Predict': time_predict_opt, "Precision Non-functioning Test":0.42, "Recall Non-functioning Test":0.63,\
"F1 Non-functioning Test":0.50, "Precision Functioning Test":0.90, "Recall Functioning Test":0.78,"Functioning Test":0.84}
#to dataframe
best_model_result_df=pd.DataFrame(data=d)
#check
best_model_result_df
best_model_result_df.to_csv(model_filepath + 'neural_network_model.csv')
metrics=[fpr_train_opt, tpr_train_opt, fpr_test_opt, tpr_test_opt]
metrics_name=['fpr_train_opt', 'tpr_train_opt', 'fpr_test_opt', 'tpr_test_opt']
#save numpy arrays for model comparison
for metric, metric_name in zip(metrics, metrics_name):
np.save(model_filepath+f'neural_network_{metric_name}', metric)