1. Water point data
Accessing water point data from the Water Point Dataset Exchange
We will set out our business problem and subsequent ML question that we are trying to answer. Then we will access the water point data for Uganda and perform some basic cleaning steps to prepare it for further merging and analysis.
Uganda and sub-saharan Africa lack access to basic water services. Not having access to water puts lives at risk through improper sanitation, dehydration. It also reinforces the cycle of poverty and heavily impacts a population that heavily relies on subsistence farming. We want to enable the Ugandan government to repair and improve water points quicker and in a more efficient manner. Our model would be used on a monthly/yearly basis to plan and carry out repairs and improvements so that the population will benefit form as much water access as possible. Ideally, the on the ground actions that result from our predictions will get fed back into the model, so it can improve it's performance in the future. Ultimately we want Ugandan citizens to meet their basic needs and enable their communities to develop and prosper.
Can we predict the functionality of water infrastructure to repair them quicker using Machine Learning?
Alternatively,
Using demographic, water infrastructure, and conflict data, can we develop a classification model that is able to identify non-functioning water points with a recall score greater than 40%?"
Water infrastructure and water access has been found to be a factor contributing to conflicts. However, the main findings form research is that demographic (socio-economic) situations are usually much more important in explaining conflict. This is why we will be using both to try to predict water functionality.
Relevant papers:
We will be accessing water point data from the source above. This notebook will do some initial cleaning and preparing for further analysis.
%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
%run /Users/thomasadler/Desktop/futuristic-platipus/keys.py
Image(images_filepath+"wpdx-api.png")
socrata_domain = 'data.waterpointdata.org'
socrata_dataset_identifier = 'eqje-vguj'
socrata_token = os.environ.get(water_api_key)
client = Socrata(socrata_domain, socrata_token, timeout=10)
The goal is to access every water point recorded in Uganda, all the information associated with that water point and put a very high limit, so we make sure we have accessed all possible points.
water_uganda_query = """
select
*
where
clean_country_name = 'Uganda'
limit
200000
"""
results = client.get(socrata_dataset_identifier, query=water_uganda_query)
water_df = pd.DataFrame.from_records(results)
uganda_water_df = water_df.copy()
uganda_water_df.tail()
uganda_water_df.info()
We see that all imported columns are of object type, and that we'll have to fix that next.
We exclude columns which have already been cleaned by the WPD team and stored in another "clean" column. We also exclude columns with links and urls.
uganda_water_df_clean = pd.DataFrame(uganda_water_df.drop(columns=['clean_country_id', 'activity_id', 'adm1', 'adm2', 'adm3', 'country_id',
'country_name', 'status', 'water_source', 'management',
'water_tech', 'orig_lnk', 'photo_lnk', 'data_lnk',
'public_data_source', 'converted', 'created_timestamp',
'days_since_report', 'updated']))
# check current columns
uganda_water_df_clean.columns
print('fecal_coliform_presence distribution:',
uganda_water_df_clean['fecal_coliform_presence'].value_counts())
print('is_latest distribution:',
uganda_water_df_clean['is_latest'].value_counts())
print('is_urban distribution:',
uganda_water_df_clean['is_urban'].value_counts())
print('status_id distribution:',
uganda_water_df_clean['status_id'].value_counts())
We want to convert yes/no, True/False columns into 1s and 0s, for modelling's sake. We assume that water points that have an unknown status, are not working.
uganda_water_df_clean['fecal_coliform_presence'] = uganda_water_df_clean[
'fecal_coliform_presence'].map({
'Present': 1,
'Absent': 0
})
uganda_water_df_clean['is_latest'] = uganda_water_df_clean['is_latest'].astype(
'int64')
uganda_water_df_clean['is_urban'] = uganda_water_df_clean['is_urban'].map({
True:
1,
False:
0
})
uganda_water_df_clean['status_id'] = uganda_water_df_clean['status_id'].map({
'Yes':
1,
'No':
0,
'Unknown':
0
})
# check
uganda_water_df_clean.info()
CAREFUL, INADVENRTETLY DROPPED is_urban
due to mistake. If re-running the analysis, should account for this error.
num_columns = ['distance_to_city', 'distance_to_primary', 'lat_deg', 'lat_deg_original',
'lon_deg', 'lon_deg_original', 'distance_to_secondary', 'distance_to_tertiary',
'distance_to_town', 'fecal_coliform_value', 'cluster_size', 'count', 'crucialness',
'install_year', 'local_population', 'pressure',
'rehab_priority', 'served_population', 'staleness_score', 'usage_cap']
for col in num_columns:
float_converter(uganda_water_df_clean, col)
# check
uganda_water_df_clean.info()
date_converter(uganda_water_df_clean, 'report_date')
uganda_water_df_clean['install_year']=pd.to_datetime(uganda_water_df_clean['install_year'], format="%Y")
# check
print(uganda_water_df_clean['report_date'].head())
print(uganda_water_df_clean['install_year'].head())
str_columns = list(uganda_water_df_clean.select_dtypes("object").columns)
# remove special characters from string columns
for col in str_columns:
uganda_water_df_clean[col].str.replace('[^\w\s]', '')
uganda_water_df_clean.isna().mean() * 100
We decide to drop columns which have more than 50% of its values that are missing. We deem these columns not be of good enough quality to be analysed. Some of these columns pertain to the presence of fecal coliform, the technology of the water point, the price of the water and other notes on the quality of the water available.
uganda_water_df_clean = uganda_water_df_clean.dropna(
axis=1, thresh=0.5 * len(uganda_water_df_clean))
# check
uganda_water_df_clean.isna().mean() * 100
The documentation informs us that this dataset is not supposed to have any duplicate columns or rows.
# uganda_water_df_clean.duplicated().sum()
# uganda_water_df_clean.T.duplicated().sum()
We impute the missing values in the remaining columns with the median for their administrative region. We first choose the lowest level, clean_adm4
, then if there are remaining missing values we choose the second lowest clean_adm3
and so on. We choose the median so that it is not too sensitive to outliers.
uganda_water_df_clean_numerical = uganda_water_df_clean[
uganda_water_df_clean.select_dtypes(exclude='object').columns]
# check
uganda_water_df_clean_numerical.head()
null_cols = uganda_water_df_clean_numerical.isna().sum()
cols_to_impute = null_cols[null_cols > 0].index
cols_to_impute = cols_to_impute.drop('install_year')
# check
cols_to_impute
for col in cols_to_impute:
plt.figure()
sns.distplot(uganda_water_df_clean[col])
plt.axvline(uganda_water_df_clean[col].mean(), c='r', label='mean')
plt.axvline(uganda_water_df_clean[col].median(), c='gold', label='median')
plt.legend()
plt.title(f'Density plot of {col}')
plt.show()
We can confirm that the median is the better choice as it represents our data better.
for col in cols_to_impute:
missing_to_regional_median(uganda_water_df_clean, col, 'clean_adm4')
# check
uganda_water_df_clean.isna().sum()
for col in cols_to_impute:
missing_to_regional_median(uganda_water_df_clean, col, 'clean_adm3')
# check
uganda_water_df_clean.isna().sum()
for col in cols_to_impute:
missing_to_regional_median(uganda_water_df_clean, col, 'clean_adm2')
# check
uganda_water_df_clean.isna().sum()
for col in cols_to_impute:
missing_to_regional_median(uganda_water_df_clean, col, 'clean_adm1')
# check
uganda_water_df_clean.isna().sum()
for col in cols_to_impute:
uganda_water_df_clean[col] = uganda_water_df_clean[col].fillna(
uganda_water_df_clean[col].median())
# check remaining null values
uganda_water_df_clean.isna().sum()
We will not fill the non-numeric columns that still have missing values as the proportion of missing values is too great (15-25%). We will decide what to do with them later in our analysis. We must do this with great care as omitting or filling these columns/rows will have a large impact on the dataset.
uganda_water_df_clean.to_csv(data_filepath + 'ta_1_uganda_water_df_clean.csv')
Image(dictionary_filepath+"1A-Water-Dictionary.png")
# #constructing columns of interest
# median_columns_adm1=list(cols_to_impute )
# median_columns_adm1.append('clean_adm1')
# #group by adm2, with median
# median_adm1=uganda_water_df_clean[median_columns_adm1].groupby('clean_adm1').median()
# median_adm1.reset_index(inplace=True)
# #merge median dataset with original dataset
# uganda_water_df_clean=pd.merge(uganda_water_df_clean, median_adm1, how='left',\
# left_on=uganda_water_df_clean['clean_adm1'],\
# right_on=median_adm1['clean_adm1'], suffixes=(None, '_median_adm1'))
# #fill null values with the median value of its region (adm2)
# for col in cols_to_impute:
# uganda_water_df_clean[col] = uganda_water_df_clean[col].fillna(uganda_water_df_clean\
# [f'{col}_median_adm1'])
# #removing key column resulting from the merge
# del uganda_water_df_clean[uganda_water_df_clean.columns[0]]
# uganda_water_df_clean.head()
# #check remaining null values
# uganda_water_df_clean.isna().sum()