We will set out our business problem and subsequent ML question that we are trying to answer. Then we will access the water point data for Uganda and perform some basic cleaning steps to prepare it for further merging and analysis.

Business Problem

Uganda and sub-saharan Africa lack access to basic water services. Not having access to water puts lives at risk through improper sanitation, dehydration. It also reinforces the cycle of poverty and heavily impacts a population that heavily relies on subsistence farming. We want to enable the Ugandan government to repair and improve water points quicker and in a more efficient manner. Our model would be used on a monthly/yearly basis to plan and carry out repairs and improvements so that the population will benefit form as much water access as possible. Ideally, the on the ground actions that result from our predictions will get fed back into the model, so it can improve it's performance in the future. Ultimately we want Ugandan citizens to meet their basic needs and enable their communities to develop and prosper.

Machine Learning Problem

Can we predict the functionality of water infrastructure to repair them quicker using Machine Learning?

Alternatively,

Using demographic, water infrastructure, and conflict data, can we develop a classification model that is able to identify non-functioning water points with a recall score greater than 40%?"

Water and Conflict

Water infrastructure and water access has been found to be a factor contributing to conflicts. However, the main findings form research is that demographic (socio-economic) situations are usually much more important in explaining conflict. This is why we will be using both to try to predict water functionality.

Relevant papers:

We will be accessing water point data from the source above. This notebook will do some initial cleaning and preparing for further analysis.

%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
/Users/thomasadler/opt/anaconda3/lib/python3.9/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
%run /Users/thomasadler/Desktop/futuristic-platipus/keys.py
Image(images_filepath+"wpdx-api.png")
socrata_domain = 'data.waterpointdata.org'
socrata_dataset_identifier = 'eqje-vguj'
socrata_token = os.environ.get(water_api_key)
client = Socrata(socrata_domain, socrata_token, timeout=10)
WARNING:root:Requests made without an app_token will be subject to strict throttling limits.

The goal is to access every water point recorded in Uganda, all the information associated with that water point and put a very high limit, so we make sure we have accessed all possible points.

water_uganda_query = """
select
    *
where
    clean_country_name = 'Uganda'
limit
    200000
"""
results = client.get(socrata_dataset_identifier, query=water_uganda_query)
water_df = pd.DataFrame.from_records(results)
uganda_water_df = water_df.copy()
uganda_water_df.tail()
row_id source lat_deg lon_deg report_date status_id water_tech_clean _water_tech_category facility_type clean_country_name ... subjective_quality scheme_id notes photo_lnk lat_deg_original lon_deg_original fecal_coliform_presence installer orig_lnk fecal_coliform_value
108901 374602 Ugandan Water Project 0.6758382999999955 32.4640283 2022-09-04T00:00:00.000 Yes Hand Pump - India Mark II Hand Pump Improved Uganda ... NaN NaN NaN NaN NaN NaN NaN Ugandan Water Project (rehabilitation) NaN NaN
108902 374581 Ugandan Water Project 0.6553664659999812 32.472553197 2022-11-05T00:00:00.000 Yes NaN NaN Improved Uganda ... NaN NaN NaN NaN NaN NaN NaN Ugandan Water Project NaN NaN
108903 374572 Ugandan Water Project 0.692485682666941 32.66798296946349 2022-06-30T00:00:00.000 Yes NaN NaN Improved Uganda ... NaN NaN NaN NaN NaN NaN NaN Ugandan Water Project NaN NaN
108904 374644 Ugandan Water Project 0.18494999999998163 32.3369641 2022-09-06T00:00:00.000 Yes Hand Pump - India Mark II Hand Pump Improved Uganda ... NaN NaN NaN NaN NaN NaN NaN Ugandan Water Project (rehabilitation) NaN NaN
108905 374660 Ugandan Water Project 0.5684852999999981 32.80591690000001 2022-09-05T00:00:00.000 Yes Hand Pump - India Mark II Hand Pump Improved Uganda ... NaN NaN NaN NaN NaN NaN NaN Ugandan Water Project (rehabilitation) NaN NaN

5 rows × 67 columns

uganda_water_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108906 entries, 0 to 108905
Data columns (total 67 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   row_id                     108906 non-null  object
 1   source                     108906 non-null  object
 2   lat_deg                    108906 non-null  object
 3   lon_deg                    108906 non-null  object
 4   report_date                108906 non-null  object
 5   status_id                  108906 non-null  object
 6   water_tech_clean           26226 non-null   object
 7   _water_tech_category       26226 non-null   object
 8   facility_type              108906 non-null  object
 9   clean_country_name         108906 non-null  object
 10  clean_adm1                 108906 non-null  object
 11  clean_adm2                 108906 non-null  object
 12  clean_adm3                 108906 non-null  object
 13  clean_adm4                 108906 non-null  object
 14  country_id                 108617 non-null  object
 15  data_lnk                   108906 non-null  object
 16  distance_to_primary        108906 non-null  object
 17  distance_to_secondary      108906 non-null  object
 18  distance_to_tertiary       108906 non-null  object
 19  distance_to_city           108906 non-null  object
 20  distance_to_town           108906 non-null  object
 21  usage_cap                  108869 non-null  object
 22  is_urban                   108906 non-null  object
 23  days_since_report          108906 non-null  object
 24  staleness_score            108906 non-null  object
 25  is_latest                  108906 non-null  bool  
 26  location_id                108906 non-null  object
 27  cluster_size               108906 non-null  object
 28  clean_country_id           108906 non-null  object
 29  country_name               108617 non-null  object
 30  water_tech                 14322 non-null   object
 31  status                     94815 non-null   object
 32  adm2                       101633 non-null  object
 33  adm3                       11408 non-null   object
 34  adm1                       101306 non-null  object
 35  new_georeferenced_column_  108906 non-null  object
 36  lat_lon_deg                108906 non-null  object
 37  public_data_source         108906 non-null  object
 38  count                      108906 non-null  object
 39  created_timestamp          108906 non-null  object
 40  updated                    108906 non-null  object
 41  water_source_clean         91467 non-null   object
 42  water_source_category      91467 non-null   object
 43  wpdx_id                    107814 non-null  object
 44  rehab_priority             18220 non-null   object
 45  served_population          89341 non-null   object
 46  local_population           89341 non-null   object
 47  crucialness                79618 non-null   object
 48  pressure                   79618 non-null   object
 49  install_year               89181 non-null   object
 50  management_clean           92281 non-null   object
 51  status_clean               89408 non-null   object
 52  activity_id                90774 non-null   object
 53  water_source               104056 non-null  object
 54  management                 92417 non-null   object
 55  converted                  88237 non-null   object
 56  pay                        37268 non-null   object
 57  subjective_quality         4406 non-null    object
 58  scheme_id                  11302 non-null   object
 59  notes                      11075 non-null   object
 60  photo_lnk                  5090 non-null    object
 61  lat_deg_original           1556 non-null    object
 62  lon_deg_original           1556 non-null    object
 63  fecal_coliform_presence    1310 non-null    object
 64  installer                  4638 non-null    object
 65  orig_lnk                   1364 non-null    object
 66  fecal_coliform_value       213 non-null     object
dtypes: bool(1), object(66)
memory usage: 54.9+ MB

We see that all imported columns are of object type, and that we'll have to fix that next.

We exclude columns which have already been cleaned by the WPD team and stored in another "clean" column. We also exclude columns with links and urls.

uganda_water_df_clean = pd.DataFrame(uganda_water_df.drop(columns=['clean_country_id', 'activity_id', 'adm1', 'adm2', 'adm3', 'country_id',
                                                                   'country_name', 'status', 'water_source', 'management',
                                                                   'water_tech', 'orig_lnk', 'photo_lnk', 'data_lnk',
                                                                   'public_data_source', 'converted', 'created_timestamp',
                                                                   'days_since_report', 'updated']))

# check current columns
uganda_water_df_clean.columns
Index(['row_id', 'source', 'lat_deg', 'lon_deg', 'report_date', 'status_id',
       'water_tech_clean', '_water_tech_category', 'facility_type',
       'clean_country_name', 'clean_adm1', 'clean_adm2', 'clean_adm3',
       'clean_adm4', 'distance_to_primary', 'distance_to_secondary',
       'distance_to_tertiary', 'distance_to_city', 'distance_to_town',
       'usage_cap', 'is_urban', 'staleness_score', 'is_latest', 'location_id',
       'cluster_size', 'new_georeferenced_column_', 'lat_lon_deg', 'count',
       'water_source_clean', 'water_source_category', 'wpdx_id',
       'rehab_priority', 'served_population', 'local_population',
       'crucialness', 'pressure', 'install_year', 'management_clean',
       'status_clean', 'pay', 'subjective_quality', 'scheme_id', 'notes',
       'lat_deg_original', 'lon_deg_original', 'fecal_coliform_presence',
       'installer', 'fecal_coliform_value'],
      dtype='object')
print('fecal_coliform_presence distribution:',
      uganda_water_df_clean['fecal_coliform_presence'].value_counts())
print('is_latest distribution:',
      uganda_water_df_clean['is_latest'].value_counts())
print('is_urban distribution:',
      uganda_water_df_clean['is_urban'].value_counts())
print('status_id distribution:',
      uganda_water_df_clean['status_id'].value_counts())
fecal_coliform_presence distribution: Present    1035
Absent      275
Name: fecal_coliform_presence, dtype: int64
is_latest distribution: True     96578
False    12328
Name: is_latest, dtype: int64
is_urban distribution: False    106175
True       2731
Name: is_urban, dtype: int64
status_id distribution: Yes        87600
No         18731
Unknown     2575
Name: status_id, dtype: int64

We want to convert yes/no, True/False columns into 1s and 0s, for modelling's sake. We assume that water points that have an unknown status, are not working.

uganda_water_df_clean['fecal_coliform_presence'] = uganda_water_df_clean[
    'fecal_coliform_presence'].map({
        'Present': 1,
        'Absent': 0
    })

uganda_water_df_clean['is_latest'] = uganda_water_df_clean['is_latest'].astype(
    'int64')

uganda_water_df_clean['is_urban'] = uganda_water_df_clean['is_urban'].map({
    True:
    1,
    False:
    0
})

uganda_water_df_clean['status_id'] = uganda_water_df_clean['status_id'].map({
    'Yes':
    1,
    'No':
    0,
    'Unknown':
    0
})

# check
uganda_water_df_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108906 entries, 0 to 108905
Data columns (total 48 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   row_id                     108906 non-null  object 
 1   source                     108906 non-null  object 
 2   lat_deg                    108906 non-null  object 
 3   lon_deg                    108906 non-null  object 
 4   report_date                108906 non-null  object 
 5   status_id                  108906 non-null  int64  
 6   water_tech_clean           26226 non-null   object 
 7   _water_tech_category       26226 non-null   object 
 8   facility_type              108906 non-null  object 
 9   clean_country_name         108906 non-null  object 
 10  clean_adm1                 108906 non-null  object 
 11  clean_adm2                 108906 non-null  object 
 12  clean_adm3                 108906 non-null  object 
 13  clean_adm4                 108906 non-null  object 
 14  distance_to_primary        108906 non-null  object 
 15  distance_to_secondary      108906 non-null  object 
 16  distance_to_tertiary       108906 non-null  object 
 17  distance_to_city           108906 non-null  object 
 18  distance_to_town           108906 non-null  object 
 19  usage_cap                  108869 non-null  object 
 20  is_urban                   0 non-null       float64
 21  staleness_score            108906 non-null  object 
 22  is_latest                  108906 non-null  int64  
 23  location_id                108906 non-null  object 
 24  cluster_size               108906 non-null  object 
 25  new_georeferenced_column_  108906 non-null  object 
 26  lat_lon_deg                108906 non-null  object 
 27  count                      108906 non-null  object 
 28  water_source_clean         91467 non-null   object 
 29  water_source_category      91467 non-null   object 
 30  wpdx_id                    107814 non-null  object 
 31  rehab_priority             18220 non-null   object 
 32  served_population          89341 non-null   object 
 33  local_population           89341 non-null   object 
 34  crucialness                79618 non-null   object 
 35  pressure                   79618 non-null   object 
 36  install_year               89181 non-null   object 
 37  management_clean           92281 non-null   object 
 38  status_clean               89408 non-null   object 
 39  pay                        37268 non-null   object 
 40  subjective_quality         4406 non-null    object 
 41  scheme_id                  11302 non-null   object 
 42  notes                      11075 non-null   object 
 43  lat_deg_original           1556 non-null    object 
 44  lon_deg_original           1556 non-null    object 
 45  fecal_coliform_presence    1310 non-null    float64
 46  installer                  4638 non-null    object 
 47  fecal_coliform_value       213 non-null     object 
dtypes: float64(2), int64(2), object(44)
memory usage: 39.9+ MB

CAREFUL, INADVENRTETLY DROPPED is_urban due to mistake. If re-running the analysis, should account for this error.

num_columns = ['distance_to_city', 'distance_to_primary', 'lat_deg', 'lat_deg_original',
               'lon_deg', 'lon_deg_original', 'distance_to_secondary', 'distance_to_tertiary',
               'distance_to_town', 'fecal_coliform_value', 'cluster_size', 'count', 'crucialness',
               'install_year', 'local_population', 'pressure',
               'rehab_priority', 'served_population', 'staleness_score', 'usage_cap']

for col in num_columns:
    float_converter(uganda_water_df_clean, col)

# check
uganda_water_df_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108906 entries, 0 to 108905
Data columns (total 48 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   row_id                     108906 non-null  object 
 1   source                     108906 non-null  object 
 2   lat_deg                    108906 non-null  float32
 3   lon_deg                    108906 non-null  float32
 4   report_date                108906 non-null  object 
 5   status_id                  108906 non-null  int64  
 6   water_tech_clean           26226 non-null   object 
 7   _water_tech_category       26226 non-null   object 
 8   facility_type              108906 non-null  object 
 9   clean_country_name         108906 non-null  object 
 10  clean_adm1                 108906 non-null  object 
 11  clean_adm2                 108906 non-null  object 
 12  clean_adm3                 108906 non-null  object 
 13  clean_adm4                 108906 non-null  object 
 14  distance_to_primary        108906 non-null  float32
 15  distance_to_secondary      108906 non-null  float32
 16  distance_to_tertiary       108906 non-null  float32
 17  distance_to_city           108906 non-null  float32
 18  distance_to_town           108906 non-null  float32
 19  usage_cap                  108869 non-null  float32
 20  is_urban                   0 non-null       float64
 21  staleness_score            108906 non-null  float32
 22  is_latest                  108906 non-null  int64  
 23  location_id                108906 non-null  object 
 24  cluster_size               108906 non-null  float32
 25  new_georeferenced_column_  108906 non-null  object 
 26  lat_lon_deg                108906 non-null  object 
 27  count                      108906 non-null  float32
 28  water_source_clean         91467 non-null   object 
 29  water_source_category      91467 non-null   object 
 30  wpdx_id                    107814 non-null  object 
 31  rehab_priority             18220 non-null   float32
 32  served_population          89341 non-null   float32
 33  local_population           89341 non-null   float32
 34  crucialness                79618 non-null   float32
 35  pressure                   79618 non-null   float32
 36  install_year               89181 non-null   float32
 37  management_clean           92281 non-null   object 
 38  status_clean               89408 non-null   object 
 39  pay                        37268 non-null   object 
 40  subjective_quality         4406 non-null    object 
 41  scheme_id                  11302 non-null   object 
 42  notes                      11075 non-null   object 
 43  lat_deg_original           1556 non-null    float32
 44  lon_deg_original           1556 non-null    float32
 45  fecal_coliform_presence    1310 non-null    float64
 46  installer                  4638 non-null    object 
 47  fecal_coliform_value       213 non-null     float32
dtypes: float32(20), float64(2), int64(2), object(24)
memory usage: 31.6+ MB
date_converter(uganda_water_df_clean, 'report_date')
uganda_water_df_clean['install_year']=pd.to_datetime(uganda_water_df_clean['install_year'], format="%Y")


# check
print(uganda_water_df_clean['report_date'].head())
print(uganda_water_df_clean['install_year'].head())
0   2005-09-07
1   2005-09-07
2   2005-09-07
3   2005-09-07
4   2005-09-07
Name: report_date, dtype: datetime64[ns]
0   NaT
1   NaT
2   NaT
3   NaT
4   NaT
Name: install_year, dtype: datetime64[ns]
str_columns = list(uganda_water_df_clean.select_dtypes("object").columns)

# remove special characters from string columns
for col in str_columns:
    uganda_water_df_clean[col].str.replace('[^\w\s]', '')
uganda_water_df_clean.isna().mean() * 100
row_id                         0.000000
source                         0.000000
lat_deg                        0.000000
lon_deg                        0.000000
report_date                    0.000000
status_id                      0.000000
water_tech_clean              75.918682
_water_tech_category          75.918682
facility_type                  0.000000
clean_country_name             0.000000
clean_adm1                     0.000000
clean_adm2                     0.000000
clean_adm3                     0.000000
clean_adm4                     0.000000
distance_to_primary            0.000000
distance_to_secondary          0.000000
distance_to_tertiary           0.000000
distance_to_city               0.000000
distance_to_town               0.000000
usage_cap                      0.033974
is_urban                     100.000000
staleness_score                0.000000
is_latest                      0.000000
location_id                    0.000000
cluster_size                   0.000000
new_georeferenced_column_      0.000000
lat_lon_deg                    0.000000
count                          0.000000
water_source_clean            16.012892
water_source_category         16.012892
wpdx_id                        1.002700
rehab_priority                83.269976
served_population             17.965034
local_population              17.965034
crucialness                   26.892917
pressure                      26.892917
install_year                  18.111950
management_clean              15.265458
status_clean                  17.903513
pay                           65.779663
subjective_quality            95.954309
scheme_id                     89.622243
notes                         89.830680
lat_deg_original              98.571245
lon_deg_original              98.571245
fecal_coliform_presence       98.797128
installer                     95.741281
fecal_coliform_value          99.804418
dtype: float64

We decide to drop columns which have more than 50% of its values that are missing. We deem these columns not be of good enough quality to be analysed. Some of these columns pertain to the presence of fecal coliform, the technology of the water point, the price of the water and other notes on the quality of the water available.

uganda_water_df_clean = uganda_water_df_clean.dropna(
    axis=1, thresh=0.5 * len(uganda_water_df_clean))

# check
uganda_water_df_clean.isna().mean() * 100
row_id                        0.000000
source                        0.000000
lat_deg                       0.000000
lon_deg                       0.000000
report_date                   0.000000
status_id                     0.000000
facility_type                 0.000000
clean_country_name            0.000000
clean_adm1                    0.000000
clean_adm2                    0.000000
clean_adm3                    0.000000
clean_adm4                    0.000000
distance_to_primary           0.000000
distance_to_secondary         0.000000
distance_to_tertiary          0.000000
distance_to_city              0.000000
distance_to_town              0.000000
usage_cap                     0.033974
staleness_score               0.000000
is_latest                     0.000000
location_id                   0.000000
cluster_size                  0.000000
new_georeferenced_column_     0.000000
lat_lon_deg                   0.000000
count                         0.000000
water_source_clean           16.012892
water_source_category        16.012892
wpdx_id                       1.002700
served_population            17.965034
local_population             17.965034
crucialness                  26.892917
pressure                     26.892917
install_year                 18.111950
management_clean             15.265458
status_clean                 17.903513
dtype: float64

The documentation informs us that this dataset is not supposed to have any duplicate columns or rows.

# uganda_water_df_clean.duplicated().sum()
# uganda_water_df_clean.T.duplicated().sum()

We impute the missing values in the remaining columns with the median for their administrative region. We first choose the lowest level, clean_adm4, then if there are remaining missing values we choose the second lowest clean_adm3 and so on. We choose the median so that it is not too sensitive to outliers.

uganda_water_df_clean_numerical = uganda_water_df_clean[
    uganda_water_df_clean.select_dtypes(exclude='object').columns]

# check
uganda_water_df_clean_numerical.head()
lat_deg lon_deg report_date status_id distance_to_primary distance_to_secondary distance_to_tertiary distance_to_city distance_to_town usage_cap staleness_score is_latest cluster_size count served_population local_population crucialness pressure install_year
0 0.158537 30.490643 2005-09-07 1 37666.023438 4875.484375 23.399448 45104.835938 32223.873047 250.0 14.327719 0 76.0 1.0 NaN NaN NaN NaN NaT
1 0.070597 30.415651 2005-09-07 1 35843.273438 2602.279297 971.238464 38419.558594 24194.054688 250.0 14.327719 0 8.0 1.0 NaN NaN NaN NaN NaT
2 0.158667 30.490551 2005-09-07 1 37651.726562 4860.169922 34.916096 45094.210938 32236.693359 250.0 14.327719 0 76.0 1.0 NaN NaN NaN NaN NaT
3 0.430480 30.751951 2005-09-07 0 14592.108398 16548.333984 1269.275391 58369.609375 23365.746094 250.0 14.327719 1 1.0 1.0 1.0 25.0 0.04 0.004 NaT
4 -0.134680 30.351593 2005-09-07 1 29073.130859 6369.317871 3270.906982 45044.160156 16042.707031 250.0 14.327719 0 2.0 1.0 NaN NaN NaN NaN NaT
null_cols = uganda_water_df_clean_numerical.isna().sum()
cols_to_impute = null_cols[null_cols > 0].index
cols_to_impute = cols_to_impute.drop('install_year')

# check
cols_to_impute
Index(['usage_cap', 'served_population', 'local_population', 'crucialness',
       'pressure'],
      dtype='object')
for col in cols_to_impute:
    plt.figure()
    sns.distplot(uganda_water_df_clean[col])
    plt.axvline(uganda_water_df_clean[col].mean(), c='r', label='mean')
    plt.axvline(uganda_water_df_clean[col].median(), c='gold', label='median')
    plt.legend()
    plt.title(f'Density plot of {col}')
    plt.show()

We can confirm that the median is the better choice as it represents our data better.

for col in cols_to_impute:
    missing_to_regional_median(uganda_water_df_clean, col, 'clean_adm4')

# check
uganda_water_df_clean.isna().sum()
row_id                           0
source                           0
lat_deg                          0
lon_deg                          0
report_date                      0
status_id                        0
facility_type                    0
clean_country_name               0
clean_adm1                       0
clean_adm2                       0
clean_adm3                       0
clean_adm4                       0
distance_to_primary              0
distance_to_secondary            0
distance_to_tertiary             0
distance_to_city                 0
distance_to_town                 0
usage_cap                        0
staleness_score                  0
is_latest                        0
location_id                      0
cluster_size                     0
new_georeferenced_column_        0
lat_lon_deg                      0
count                            0
water_source_clean           17439
water_source_category        17439
wpdx_id                       1092
served_population                8
local_population                 8
crucialness                    113
pressure                       113
install_year                 19725
management_clean             16625
status_clean                 19498
dtype: int64
for col in cols_to_impute:
    missing_to_regional_median(uganda_water_df_clean, col, 'clean_adm3')

# check
uganda_water_df_clean.isna().sum()
row_id                           0
source                           0
lat_deg                          0
lon_deg                          0
report_date                      0
status_id                        0
facility_type                    0
clean_country_name               0
clean_adm1                       0
clean_adm2                       0
clean_adm3                       0
clean_adm4                       0
distance_to_primary              0
distance_to_secondary            0
distance_to_tertiary             0
distance_to_city                 0
distance_to_town                 0
usage_cap                        0
staleness_score                  0
is_latest                        0
location_id                      0
cluster_size                     0
new_georeferenced_column_        0
lat_lon_deg                      0
count                            0
water_source_clean           17439
water_source_category        17439
wpdx_id                       1092
served_population                0
local_population                 0
crucialness                      0
pressure                         0
install_year                 19725
management_clean             16625
status_clean                 19498
dtype: int64
for col in cols_to_impute:
    missing_to_regional_median(uganda_water_df_clean, col, 'clean_adm2')

# check
uganda_water_df_clean.isna().sum()
row_id                           0
source                           0
lat_deg                          0
lon_deg                          0
report_date                      0
status_id                        0
facility_type                    0
clean_country_name               0
clean_adm1                       0
clean_adm2                       0
clean_adm3                       0
clean_adm4                       0
distance_to_primary              0
distance_to_secondary            0
distance_to_tertiary             0
distance_to_city                 0
distance_to_town                 0
usage_cap                        0
staleness_score                  0
is_latest                        0
location_id                      0
cluster_size                     0
new_georeferenced_column_        0
lat_lon_deg                      0
count                            0
water_source_clean           17439
water_source_category        17439
wpdx_id                       1092
served_population                0
local_population                 0
crucialness                      0
pressure                         0
install_year                 19725
management_clean             16625
status_clean                 19498
dtype: int64
for col in cols_to_impute:
    missing_to_regional_median(uganda_water_df_clean, col, 'clean_adm1')

# check
uganda_water_df_clean.isna().sum()
row_id                           0
source                           0
lat_deg                          0
lon_deg                          0
report_date                      0
status_id                        0
facility_type                    0
clean_country_name               0
clean_adm1                       0
clean_adm2                       0
clean_adm3                       0
clean_adm4                       0
distance_to_primary              0
distance_to_secondary            0
distance_to_tertiary             0
distance_to_city                 0
distance_to_town                 0
usage_cap                        0
staleness_score                  0
is_latest                        0
location_id                      0
cluster_size                     0
new_georeferenced_column_        0
lat_lon_deg                      0
count                            0
water_source_clean           17439
water_source_category        17439
wpdx_id                       1092
served_population                0
local_population                 0
crucialness                      0
pressure                         0
install_year                 19725
management_clean             16625
status_clean                 19498
dtype: int64
for col in cols_to_impute:
    uganda_water_df_clean[col] = uganda_water_df_clean[col].fillna(
        uganda_water_df_clean[col].median())

# check remaining null values
uganda_water_df_clean.isna().sum()
row_id                           0
source                           0
lat_deg                          0
lon_deg                          0
report_date                      0
status_id                        0
facility_type                    0
clean_country_name               0
clean_adm1                       0
clean_adm2                       0
clean_adm3                       0
clean_adm4                       0
distance_to_primary              0
distance_to_secondary            0
distance_to_tertiary             0
distance_to_city                 0
distance_to_town                 0
usage_cap                        0
staleness_score                  0
is_latest                        0
location_id                      0
cluster_size                     0
new_georeferenced_column_        0
lat_lon_deg                      0
count                            0
water_source_clean           17439
water_source_category        17439
wpdx_id                       1092
served_population                0
local_population                 0
crucialness                      0
pressure                         0
install_year                 19725
management_clean             16625
status_clean                 19498
dtype: int64

We will not fill the non-numeric columns that still have missing values as the proportion of missing values is too great (15-25%). We will decide what to do with them later in our analysis. We must do this with great care as omitting or filling these columns/rows will have a large impact on the dataset.

uganda_water_df_clean.to_csv(data_filepath + 'ta_1_uganda_water_df_clean.csv')
Image(dictionary_filepath+"1A-Water-Dictionary.png")
# #constructing columns of interest
# median_columns_adm1=list(cols_to_impute )
# median_columns_adm1.append('clean_adm1')

# #group by adm2, with median
# median_adm1=uganda_water_df_clean[median_columns_adm1].groupby('clean_adm1').median()
# median_adm1.reset_index(inplace=True)

# #merge median dataset with original dataset
# uganda_water_df_clean=pd.merge(uganda_water_df_clean, median_adm1, how='left',\
#                                left_on=uganda_water_df_clean['clean_adm1'],\
#                                right_on=median_adm1['clean_adm1'], suffixes=(None, '_median_adm1'))

# #fill null values with the median value of its region (adm2)
# for col in cols_to_impute:
#     uganda_water_df_clean[col] = uganda_water_df_clean[col].fillna(uganda_water_df_clean\
#                                                                    [f'{col}_median_adm1'])
# #removing key column resulting from the merge
# del uganda_water_df_clean[uganda_water_df_clean.columns[0]]
# uganda_water_df_clean.head()

# #check remaining null values
# uganda_water_df_clean.isna().sum()