4. Master dataframe
Creating master dataframe, combining water point, demographic and conflict data
We merge water point data with conflict and demographic data at the sub-county level. We end up with more than 100k water points and information on whether it is not functioning or not as well as all the regional information described above.
%run /Users/thomasadler/Desktop/futuristic-platipus/capstone/notebooks/ta_01_packages_functions.py
water_df=pd.read_csv(data_filepath + 'ta_1_uganda_water_df_clean.csv', index_col=0)
#check
water_df.info()
conflict_df=pd.read_csv(data_filepath +'ta_3_conflict_df_clean.csv', index_col=0)
#check
conflict_df.info()
demographic_df=pd.read_csv(data_filepath +'ta_2_subcounty_demographic_clean.csv', index_col=0)
#check
demographic_df.info()
master_df=pd.merge(water_df, demographic_df, how='left',\
left_on=water_df['clean_adm4'],\
right_on=demographic_df['clean_adm4'], suffixes=['x', 'y'])
#check
master_df.info()
We then want to to know how violent and unstable a region is. We sum up the total number of fatalities and events/conflicts a region has had since 1997 and merge them with our water point and demographic data.
conflict_fatal_grouped=conflict_df[['clean_adm4', 'fatalities']].groupby('clean_adm4').sum()
conflict_fatal_grouped.columns=['total_fatalities_adm4']
conflict_fatal_grouped.reset_index(inplace=True)
#check
conflict_fatal_grouped
conflict_events_grouped=conflict_df[['clean_adm4', 'data_id']].groupby(['clean_adm4']).count()
conflict_events_grouped.columns=['total_events_adm4']
conflict_events_grouped.reset_index(inplace=True)
#check
conflict_events_grouped
del master_df[master_df.columns[0]]
#merge water dataset with fatalities dataset
master_df=pd.merge(master_df, conflict_fatal_grouped, how='left',\
left_on=master_df['clean_adm4x'],\
right_on=conflict_fatal_grouped['clean_adm4'],suffixes=('z', 'w'))
#check
master_df.head()
del master_df[master_df.columns[0]]
#merge water dataset with fatalities dataset
master_df=pd.merge(master_df, conflict_events_grouped, how='left',\
left_on=master_df['clean_adm4x'],\
right_on=conflict_events_grouped['clean_adm4'], suffixes=('a', 'b'))
#check
master_df.head()
master_df.columns
We choose columns which may potentially be able to help us predict whether a water point is functioning or not.
master_df_clean = master_df[['wpdx_id',
'lat_deg', 'lon_deg', 'status_id',
'clean_adm1', 'clean_adm2', 'clean_adm3', 'clean_adm4x', 'distance_to_primary',
'distance_to_secondary', 'distance_to_tertiary', 'distance_to_city',
'distance_to_town', 'usage_cap', 'staleness_score', 'cluster_size', 'water_source_clean','install_year',
'management_clean', 'served_population', 'local_population', 'crucialness', 'pressure',
'perc_hh_head_male', 'perc_pop612_primary', 'perc_pop1318_secondary',
'perc_pop18p_illiterate', 'perc_pop017_certificate',
'perc_pop017_both_parents', 'perc_pop2p_disability', 'perc_pop1017_married',
'perc_pop1217_birth', 'perc_pop1464_working', 'perc_pop10p_mobile_phone',
'perc_hh_temp_dwelling', 'perc_hh_mosquito_net', 'perc_hh_piped_water',
'perc_hh_borehole', 'perc_hh_toilet', 'perc_hh_own_house',
'perc_hh_own_tv', 'perc_hh_bank_acc', 'perc_hh_subs_farm',
'perc_hh_less2meals', 'perc_hh_electricity', 'total_fatalities_adm4',
'total_events_adm4'
]]
#check
master_df_clean.info()
print(f'There are {master_df_clean.isna().sum().sum()} null values')
master_df_clean['clean_adm1'].value_counts()
master_df_clean[['distance_to_primary', 'distance_to_secondary', 'distance_to_tertiary', 'distance_to_city']].describe().T
The distances are evidently in metres, as we know that the area of Uganda is 241,000 km $^2$. So we want transform these columns into kilometres, to make our analysis and interpretation easier.
distances=['distance_to_primary', 'distance_to_secondary', 'distance_to_tertiary', 'distance_to_city', 'distance_to_town']
#transform into kilometres
master_df_clean[distances]=master_df_clean[distances]/1000
#check
master_df_clean[['distance_to_primary', 'distance_to_secondary', 'distance_to_tertiary', 'distance_to_city']].describe().T
master_df_clean_numerical = master_df_clean[
master_df_clean.select_dtypes(exclude='object').columns]
# check
master_df_clean_numerical.head()
null_cols = master_df_clean_numerical.isna().sum()
cols_to_impute = null_cols[null_cols > 0].index
cols_to_impute = cols_to_impute.drop(['total_events_adm4','total_fatalities_adm4'])
# check
cols_to_impute
We have missing values because not all regions from our water point dataset match the official administrative regions. Water points for which the region is not recognisable ends up with a missing value. Here, we will first fill the null values for the demographic data with the median of its lowest level region we can attribute it to. We start with adm4
.
for col in cols_to_impute:
plt.figure()
sns.distplot(master_df_clean_numerical[col])
plt.axvline(master_df_clean_numerical[col].mean(), c='r', label='mean')
plt.axvline(master_df_clean_numerical[col].median(), c='gold', label='median')
plt.legend()
plt.title(f'Density plot of {col}')
plt.show()
These graphs show us that the median is a better alternative to mean as it is more representative of our dataset. There are more observations which are closer to the median value than to the average value because the latter is more sensitive to outliers.
for col in cols_to_impute:
missing_to_regional_median(master_df_clean, col, 'clean_adm4x')
#checking remaining null values
master_df_clean.isna().sum().sum()
Since we still have missing values we fill missing values with the median of next lowest administrative leel, adm3. We go on until we do not have any more missing values.
for col in cols_to_impute:
missing_to_regional_median(master_df_clean, col, 'clean_adm3')
#checking remaining null values
master_df_clean.isna().sum().sum()
for col in cols_to_impute:
missing_to_regional_median(master_df_clean, col, 'clean_adm2')
#checking remaining null values
master_df_clean.isna().sum().sum()
for col in cols_to_impute:
missing_to_regional_median(master_df_clean, col, 'clean_adm1')
#checking remaining null values
master_df_clean.isna().sum().sum()
We end up inputting the country's median for the few remaining missing values.
for col in cols_to_impute:
master_df_clean[col] = master_df_clean[col].fillna(
master_df_clean[col].median())
# check remaining null values
master_df_clean.isna().sum().sum()
Missing values for fatalities/events essentially mean there were no conflicts or fatalities for that region since 1997. We assume that the conflict dataset we accessed was extensive and included all conflicts since 1997.
master_df_clean.isna().sum()
for col in ['total_fatalities_adm4','total_events_adm4']:
master_df_clean[col] = master_df_clean[col].fillna(0)
# check remaining null values
master_df_clean.isna().sum().sum()
date_converter(master_df_clean, 'install_year')
#converting to year
master_df_clean['install_year']=master_df_clean['install_year'].dt.year
#visualising distribution of installation year
plt.figure()
sns.histplot(master_df_clean['install_year'])
plt.axvline(master_df_clean['install_year'].mean(), c='r', label='mean')
plt.axvline(master_df_clean['install_year'].median(), c='gold', label='median')
plt.legend()
plt.title(f'Density plot for installation year')
plt.show()
We choose to create a binary column for the installation year. All water points constructed after 2006 are encoded with a 1, and all before are encoded 0. We choose 2006 because it was the first multi-party election in Uganda for 25 years. We consider this a an important enough turning point to differentiate between water points. As we see on the distribution above, we have a healthy balance of constructions before and after 2006. In addition, water points with missing installation year information are encoded with 0. We assume that if the installation year was unknown, it was probably because it was old (when bookeeping and official records were not as good). We would expect to see that water points constructed after 2006 might be of better quality due to a more democratic government and better technology.
Ideally we would look at how our binarisation impacts the correlation with our outcome variable. However, because of missing values, the install_year
and status_id
are of different lengths, making that analysis not possible.
master_df_clean['install_year'] = np.where(
master_df_clean['install_year'] >= 2006, 1, 0)
#fill unknown years to 0
master_df_clean['install_year'] = master_df_clean['install_year'].fillna(0)
#check
master_df_clean['install_year'].value_counts()
We consider entities from the government, public institutionsa to be a form of "public" management. We also include community management as we assume they are, in some way, related to local governments or governance structures. We also assume that they have similar goals and incentives to the more formal government institutions. Religious, health and medical facilities follow the same reasoning. It is interesting to see whether public or private management structures aremore likely to keep a water point functioning: although their goals are different, their incentives might still make them want to keep a water point working.
public_management=['Other Institutional Management', 'Direct Government Operation', 'Community Management',
'School Management','Religious Institution','Health Care Facility' ]
#dummy column into binary, public(1) or not public(0)
master_df_clean['management_clean'] = np.where(
master_df_clean['management_clean'].isin(public_management), 1, 0)
#check
master_df_clean['management_clean'].value_counts()
master_df_clean['water_source_clean'].value_counts()
We create a binary column identifying whether the water point technology is complex or not. We use analysis made by Ravi & Rogger, 2021 to make this decision.
complex_tech=['Borehole', 'Piped Water', 'Sand or Sub-surface Dam', 'Packaged Water', 'Delivered Water']
#dummy column into binary, complex techology(1) or not not(0)
master_df_clean['water_source_clean'] = np.where(
master_df_clean['water_source_clean'].isin(complex_tech), 1, 0)
#check
master_df_clean['water_source_clean'].value_counts()
master_df_clean.isna().sum().sum()
master_df_clean.columns
master_df_clean.rename(columns = {'status_id':'is_functioning', 'install_year':'is_installed_after_2006', 'water_source_clean':'is_complex_tech', 'management_clean':'is_public_management','clean_adm4x':'clean_adm4' }, inplace = True)
We want to calculate what proportion of the local population is being served by the water point.
master_df_clean['perc_local_served']=(master_df_clean['served_population']/master_df_clean['local_population'])*100
#check
master_df_clean[['served_population', 'local_population', 'perc_local_served']].head()
master_df_clean['perc_local_served'].isna().sum()
These are null values because the water point is essentially not servicing any of the local population. We can safely fill those null values with 0.
master_df_clean['perc_local_served']=master_df_clean['perc_local_served'].fillna(0)
#check
master_df_clean['perc_local_served'].isna().sum()
master_df_clean.info()
print(f"There are {master_df_clean.duplicated().sum()} duplicated rows.")
master_df_clean_final=master_df_clean.drop_duplicates()
print(f"There are now {master_df_clean_final.duplicated().sum()} duplicated rows.")
Our dataset is now ready for EDA, feature selection and modelling. We have no missing values, no duplicate rows, every column is interpretable and can be modelled.
master_df_clean_final.to_csv(data_filepath + 'ta_4_master_df.csv')
Image(dictionary_filepath+"4A-Master-Dictionary.png")
Image(dictionary_filepath+"4B-Master-Dictionary.png")