We have manually downloaded regional demographic data for Uganda on our computer and will load it in this notebook to do some initial cleaning.

We will clean and prepare regional demographic data for further analysis.

%run /Users/thomasadler/Desktop/futuristic-platipus/notebooks/ta_01_packages_functions.py

We have manually chosen the relevant columns from the Ugandan Statistics datasets for multiple reasons:

  • every region (4) has its own dataset, with inconsistent naming of columns and sheets
  • there are multiple sheets within each excel workbook, each with an inconsistent number of indexes
  • manually choosing an initial set of relevant variables is easier when visualizing the full names in an excel workbook and comparing those with other columns
  • the number of data points is manageable in excel and does not run the risk of the data being cut
  • minimal data processing has been carried out to avoid any mistakes
regions_df = ['eastern', 'western', 'central', 'northern']
eastern_df = pd.DataFrame(pd.read_excel(
    f'{data_filepath}{regions_df[0]}_demographic.xlsx'))
western_df = pd.DataFrame(pd.read_excel(
    f'{data_filepath}{regions_df[1]}_demographic.xlsx'))
central_df = pd.DataFrame(pd.read_excel(
    f'{data_filepath}{regions_df[2]}_demographic.xlsx'))
northern_df = pd.DataFrame(pd.read_excel(
    f'{data_filepath}{regions_df[3]}_demographic.xlsx'))
regions_df = [eastern_df, western_df, central_df, northern_df]
demographic = pd.concat(regions_df, axis=0)

# check
demographic.head()
subcounty parish tot_pop_parish perc_hh_head_male perc_pop612_primary perc_pop1318_secondary perc_pop18p_illiterate perc_pop017_certificate perc_pop017_both_parents perc_pop2p_disability ... perc_hh_mosquito_net perc_hh_piped_water perc_hh_borehole perc_hh_toilet perc_hh_own_house perc_hh_own_tv perc_hh_bank_acc perc_hh_subs_farm perc_hh_less2meals perc_hh_electricity
0 Eastern Division Naluwerere 5878 74.8 91.5 49.9 29.0 31.4 92.4 8.9 ... 83.5 20.1 59.4 5.0 55.8 12.8 23.1 38.4 6.3 19.5
1 Eastern Division Nkusi 7946 74.3 88.7 58.2 14.9 41.9 93.6 7.8 ... 86.9 34.3 10.6 1.1 25.9 33.3 44.3 12.1 8.3 44.9
2 Western Division Bwole 8370 74.5 90.9 56.8 14.1 41.9 92.8 9.7 ... 89.7 29.7 26.9 1.2 38.4 27.8 43.8 23.7 8.4 37.4
3 Western Division Ndifakulya 6553 73.4 87.6 55.0 16.2 46.6 93.7 10.3 ... 87.5 22.8 22.6 1.8 37.1 27.1 37.4 28.6 6.7 35.1
4 Budhaya Budhaya 5796 84.3 88.8 23.4 45.2 27.6 95.1 9.1 ... 80.1 0.2 96.8 13.1 88.9 1.1 9.2 81.7 5.9 6.1

5 rows × 29 columns

total_demographic = demographic.copy()
total_demographic.head()
subcounty parish tot_pop_parish perc_hh_head_male perc_pop612_primary perc_pop1318_secondary perc_pop18p_illiterate perc_pop017_certificate perc_pop017_both_parents perc_pop2p_disability ... perc_hh_mosquito_net perc_hh_piped_water perc_hh_borehole perc_hh_toilet perc_hh_own_house perc_hh_own_tv perc_hh_bank_acc perc_hh_subs_farm perc_hh_less2meals perc_hh_electricity
0 Eastern Division Naluwerere 5878 74.8 91.5 49.9 29.0 31.4 92.4 8.9 ... 83.5 20.1 59.4 5.0 55.8 12.8 23.1 38.4 6.3 19.5
1 Eastern Division Nkusi 7946 74.3 88.7 58.2 14.9 41.9 93.6 7.8 ... 86.9 34.3 10.6 1.1 25.9 33.3 44.3 12.1 8.3 44.9
2 Western Division Bwole 8370 74.5 90.9 56.8 14.1 41.9 92.8 9.7 ... 89.7 29.7 26.9 1.2 38.4 27.8 43.8 23.7 8.4 37.4
3 Western Division Ndifakulya 6553 73.4 87.6 55.0 16.2 46.6 93.7 10.3 ... 87.5 22.8 22.6 1.8 37.1 27.1 37.4 28.6 6.7 35.1
4 Budhaya Budhaya 5796 84.3 88.8 23.4 45.2 27.6 95.1 9.1 ... 80.1 0.2 96.8 13.1 88.9 1.1 9.2 81.7 5.9 6.1

5 rows × 29 columns

total_demographic.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7557 entries, 0 to 1650
Data columns (total 29 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   subcounty                    7557 non-null   object 
 1   parish                       7557 non-null   object 
 2   tot_pop_parish               7557 non-null   int64  
 3   perc_hh_head_male            7557 non-null   float64
 4   perc_pop612_primary          7557 non-null   float64
 5   perc_pop1318_secondary       7557 non-null   float64
 6   perc_pop18p_illiterate       7557 non-null   float64
 7   perc_pop017_certificate      7557 non-null   float64
 8   perc_pop017_both_parents     7557 non-null   float64
 9   perc_pop2p_disability        7557 non-null   float64
 10  perc_pop1017_married         7557 non-null   float64
 11  perc_pop1217_birth           7557 non-null   float64
 12  perc_pop1464_working         7557 non-null   float64
 13  perc_pop10p_mobile_phone     7557 non-null   float64
 14  perc_hh_temp_dwelling        7502 non-null   float64
 15  perc_pop_5km_dist_primary    7556 non-null   float64
 16  perc_pop_5km_dist_secondary  7557 non-null   float64
 17  perc_pop_5km_dist_health     7557 non-null   float64
 18  perc_pop_5km_dist_police     7557 non-null   float64
 19  perc_hh_mosquito_net         7557 non-null   float64
 20  perc_hh_piped_water          7555 non-null   float64
 21  perc_hh_borehole             7552 non-null   float64
 22  perc_hh_toilet               7540 non-null   float64
 23  perc_hh_own_house            7557 non-null   float64
 24  perc_hh_own_tv               7557 non-null   float64
 25  perc_hh_bank_acc             7557 non-null   float64
 26  perc_hh_subs_farm            7557 non-null   float64
 27  perc_hh_less2meals           7557 non-null   float64
 28  perc_hh_electricity          7557 non-null   float64
dtypes: float64(26), int64(1), object(2)
memory usage: 1.7+ MB
total_demographic.isna().sum()
subcounty                       0
parish                          0
tot_pop_parish                  0
perc_hh_head_male               0
perc_pop612_primary             0
perc_pop1318_secondary          0
perc_pop18p_illiterate          0
perc_pop017_certificate         0
perc_pop017_both_parents        0
perc_pop2p_disability           0
perc_pop1017_married            0
perc_pop1217_birth              0
perc_pop1464_working            0
perc_pop10p_mobile_phone        0
perc_hh_temp_dwelling          55
perc_pop_5km_dist_primary       1
perc_pop_5km_dist_secondary     0
perc_pop_5km_dist_health        0
perc_pop_5km_dist_police        0
perc_hh_mosquito_net            0
perc_hh_piped_water             2
perc_hh_borehole                5
perc_hh_toilet                 17
perc_hh_own_house               0
perc_hh_own_tv                  0
perc_hh_bank_acc                0
perc_hh_subs_farm               0
perc_hh_less2meals              0
perc_hh_electricity             0
dtype: int64

We fill missing values of a parish by its subcounty average. Due to the low number of missing values, we assume that the average is a good representation of the subcounty's situation.

null_cols = total_demographic.isna().sum()
cols_to_impute = null_cols[null_cols > 0].index

# check
cols_to_impute
Index(['perc_hh_temp_dwelling', 'perc_pop_5km_dist_primary',
       'perc_hh_piped_water', 'perc_hh_borehole', 'perc_hh_toilet'],
      dtype='object')
for col in cols_to_impute:
    missing_to_regional_median(total_demographic, col, 'subcounty')

# check
total_demographic.isna().sum()
subcounty                      0
parish                         0
tot_pop_parish                 0
perc_hh_head_male              0
perc_pop612_primary            0
perc_pop1318_secondary         0
perc_pop18p_illiterate         0
perc_pop017_certificate        0
perc_pop017_both_parents       0
perc_pop2p_disability          0
perc_pop1017_married           0
perc_pop1217_birth             0
perc_pop1464_working           0
perc_pop10p_mobile_phone       0
perc_hh_temp_dwelling          7
perc_pop_5km_dist_primary      0
perc_pop_5km_dist_secondary    0
perc_pop_5km_dist_health       0
perc_pop_5km_dist_police       0
perc_hh_mosquito_net           0
perc_hh_piped_water            0
perc_hh_borehole               0
perc_hh_toilet                 2
perc_hh_own_house              0
perc_hh_own_tv                 0
perc_hh_bank_acc               0
perc_hh_subs_farm              0
perc_hh_less2meals             0
perc_hh_electricity            0
dtype: int64
null_cols2 = total_demographic.isna().sum()
cols_to_impute2 = null_cols[null_cols > 0].index

# check
cols_to_impute2
Index(['perc_hh_temp_dwelling', 'perc_pop_5km_dist_primary',
       'perc_hh_piped_water', 'perc_hh_borehole', 'perc_hh_toilet'],
      dtype='object')

For those that still have missing values, due to the fact that the whole subcounty does not have values, we fill the missing values with the country's average.

for col in cols_to_impute2:
    total_demographic[col] = total_demographic[col].fillna(
        total_demographic[col].mean())

# check no more null values
total_demographic.isna().sum().sum() == 0
True
total_demographic.duplicated().sum() == 0
True
total_demographic.T.duplicated().sum() == 0
True

We have information for each parish, however we need data for each subcounty. We will use population as a proxy to calculate the weight of each parish in a subcounty. We then calculate a weighted average of parishes to get subcounty values.

tot_pop_subcounty = total_demographic[[
    'subcounty', 'tot_pop_parish']].groupby('subcounty').sum()

# rename column
tot_pop_subcounty.columns = ['tot_pop_subcounty']

# check
tot_pop_subcounty
tot_pop_subcounty
subcounty
Abako 24329
Abanga 17613
Abarilela 24290
Aber 33366
Abia 23436
... ...
Yumbe Town Council 34806
Zesui 10782
Zeu 35354
Zirobwe 44992
Zombo Town Council 12091

1382 rows × 1 columns

total_demographic = total_demographic.merge(
    tot_pop_subcounty, how='left', on='subcounty')

# check
total_demographic.head()
subcounty parish tot_pop_parish perc_hh_head_male perc_pop612_primary perc_pop1318_secondary perc_pop18p_illiterate perc_pop017_certificate perc_pop017_both_parents perc_pop2p_disability ... perc_hh_piped_water perc_hh_borehole perc_hh_toilet perc_hh_own_house perc_hh_own_tv perc_hh_bank_acc perc_hh_subs_farm perc_hh_less2meals perc_hh_electricity tot_pop_subcounty
0 Eastern Division Naluwerere 5878 74.8 91.5 49.9 29.0 31.4 92.4 8.9 ... 20.1 59.4 5.0 55.8 12.8 23.1 38.4 6.3 19.5 146794
1 Eastern Division Nkusi 7946 74.3 88.7 58.2 14.9 41.9 93.6 7.8 ... 34.3 10.6 1.1 25.9 33.3 44.3 12.1 8.3 44.9 146794
2 Western Division Bwole 8370 74.5 90.9 56.8 14.1 41.9 92.8 9.7 ... 29.7 26.9 1.2 38.4 27.8 43.8 23.7 8.4 37.4 169457
3 Western Division Ndifakulya 6553 73.4 87.6 55.0 16.2 46.6 93.7 10.3 ... 22.8 22.6 1.8 37.1 27.1 37.4 28.6 6.7 35.1 169457
4 Budhaya Budhaya 5796 84.3 88.8 23.4 45.2 27.6 95.1 9.1 ... 0.2 96.8 13.1 88.9 1.1 9.2 81.7 5.9 6.1 28472

5 rows × 30 columns

total_demographic['weight'] = total_demographic['tot_pop_parish'] / \
    total_demographic['tot_pop_subcounty']

# check all weights equal to one
total_demographic[['subcounty', 'weight']].groupby('subcounty').sum(
).sum() == len(total_demographic[['subcounty']].groupby('subcounty'))
weight    True
dtype: bool
total_demographic.iloc[:, 2:-1] = total_demographic.iloc[:,
                                                         2:-1].multiply(total_demographic['weight'], axis='index')

# check
total_demographic.head()
subcounty parish tot_pop_parish perc_hh_head_male perc_pop612_primary perc_pop1318_secondary perc_pop18p_illiterate perc_pop017_certificate perc_pop017_both_parents perc_pop2p_disability ... perc_hh_borehole perc_hh_toilet perc_hh_own_house perc_hh_own_tv perc_hh_bank_acc perc_hh_subs_farm perc_hh_less2meals perc_hh_electricity tot_pop_subcounty weight
0 Eastern Division Naluwerere 235.369865 2.995180 3.663890 1.998121 1.161233 1.257335 3.699928 0.356378 ... 2.378525 0.200213 2.234372 0.512544 0.924982 1.537632 0.252268 0.780829 5878.0 0.040043
1 Eastern Division Nkusi 430.119187 4.021880 4.801356 3.150382 0.806541 2.268059 5.066594 0.422216 ... 0.573781 0.059543 1.401974 1.802538 2.397971 0.654976 0.449281 2.430449 7946.0 0.054130
2 Western Division Bwole 413.419924 3.679783 4.489829 2.805526 0.696442 2.069569 4.583676 0.479113 ... 1.328673 0.059272 1.896694 1.373127 2.163416 1.170616 0.414902 1.847300 8370.0 0.049393
3 Western Division Ndifakulya 253.408292 2.838420 3.387543 2.126882 0.626463 1.802049 3.623433 0.398307 ... 0.873955 0.069607 1.434678 1.047973 1.446280 1.105979 0.259093 1.357337 6553.0 0.038671
4 Budhaya Budhaya 1179.882551 17.160818 18.076876 4.763501 9.201292 5.618488 19.359357 1.852473 ... 19.705423 2.666746 18.097232 0.223925 1.872829 16.631540 1.201054 1.241767 5796.0 0.203568

5 rows × 31 columns

subcounty_demographic = total_demographic.groupby('subcounty').sum()

# check that no column is more than 100 (except for population)
subcounty_demographic.describe()
tot_pop_parish perc_hh_head_male perc_pop612_primary perc_pop1318_secondary perc_pop18p_illiterate perc_pop017_certificate perc_pop017_both_parents perc_pop2p_disability perc_pop1017_married perc_pop1217_birth ... perc_hh_borehole perc_hh_toilet perc_hh_own_house perc_hh_own_tv perc_hh_bank_acc perc_hh_subs_farm perc_hh_less2meals perc_hh_electricity tot_pop_subcounty weight
count 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 ... 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 1382.000000 1.382000e+03
mean 5325.183537 77.089347 80.701635 29.792665 33.362797 29.141930 91.772449 13.642423 6.564745 8.538391 ... 37.962323 11.096495 81.550674 6.662546 17.433289 79.303604 11.223005 12.089956 25061.251809 1.000000e+00
std 4841.601497 5.125613 12.615068 9.577757 13.373853 14.973238 2.422723 5.800347 2.818645 3.914319 ... 32.345151 16.196412 17.098956 10.063932 11.286255 20.531131 9.146918 12.956311 28004.997066 8.450033e-18
min 565.125115 43.296939 6.469903 3.019500 5.159501 1.896132 80.607712 2.400000 1.100000 2.060516 ... 0.000000 0.000000 12.806064 0.197382 1.831688 1.191364 1.683737 0.483304 1981.000000 1.000000e+00
25% 2922.142298 73.772953 78.781441 23.387518 25.688698 17.218885 90.523072 10.025470 4.636206 5.890590 ... 6.476420 2.043361 74.093670 1.523332 9.782145 76.522410 5.850080 4.837981 12767.750000 1.000000e+00
50% 4349.144386 77.310364 83.761165 28.671131 32.654889 27.438227 91.900018 12.706115 6.001666 7.714616 ... 30.362959 5.215642 87.873951 3.117658 13.599117 87.469722 9.005092 7.857371 19268.000000 1.000000e+00
75% 6264.462068 80.759590 86.656814 35.128398 39.105892 38.857003 93.368579 16.113406 7.951506 10.177006 ... 66.655302 12.242853 94.402075 7.030110 22.011153 92.201184 12.783530 13.648965 30635.000000 1.000000e+00
max 81962.015179 89.136705 94.120353 59.603670 95.163615 89.807095 96.771055 70.398504 24.485364 53.852961 ... 99.463761 96.092403 99.200501 69.874708 72.581032 98.034043 70.340232 85.396092 425070.000000 1.000000e+00

8 rows × 29 columns

subcounty_demographic.reset_index(inplace=True)

# check
subcounty_demographic.head()
subcounty tot_pop_parish perc_hh_head_male perc_pop612_primary perc_pop1318_secondary perc_pop18p_illiterate perc_pop017_certificate perc_pop017_both_parents perc_pop2p_disability perc_pop1017_married ... perc_hh_borehole perc_hh_toilet perc_hh_own_house perc_hh_own_tv perc_hh_bank_acc perc_hh_subs_farm perc_hh_less2meals perc_hh_electricity tot_pop_subcounty weight
0 Abako 4980.425131 77.188397 73.638242 23.234560 37.155288 40.631497 89.292223 19.016807 5.219919 ... 27.780579 11.052814 97.636623 0.733992 9.180801 93.104768 11.556529 3.969103 24329.0 1.0
1 Abanga 3657.634645 74.663209 49.826100 8.726509 56.078805 6.268466 91.904968 16.344297 8.435729 ... 4.249634 17.929563 94.744581 0.576716 9.267484 92.562215 4.859519 3.101715 17613.0 1.0
2 Abarilela 4316.878551 71.659304 82.324026 23.733903 43.352421 32.763751 92.361029 16.989201 4.838436 ... 96.878987 32.313685 95.087715 1.206439 11.167011 95.398728 14.410782 13.117472 24290.0 1.0
3 Aber 8844.329617 75.875232 76.815372 22.464650 37.663891 39.250420 91.166310 12.758754 10.031098 ... 72.613025 9.453135 95.480702 1.503926 8.704004 93.712237 6.937463 6.639306 33366.0 1.0
4 Abia 3923.494880 79.001784 74.058850 22.057540 35.890114 28.721663 89.266701 16.557975 9.301314 ... 42.406439 14.932744 98.024441 0.811487 9.335049 94.918928 21.129971 4.331349 23436.0 1.0

5 rows × 30 columns

subcounty_demographic_clean = pd.DataFrame(
    subcounty_demographic.drop(columns=['weight', 'tot_pop_parish']))

# check current columns
subcounty_demographic_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1382 entries, 0 to 1381
Data columns (total 28 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   subcounty                    1382 non-null   object 
 1   perc_hh_head_male            1382 non-null   float64
 2   perc_pop612_primary          1382 non-null   float64
 3   perc_pop1318_secondary       1382 non-null   float64
 4   perc_pop18p_illiterate       1382 non-null   float64
 5   perc_pop017_certificate      1382 non-null   float64
 6   perc_pop017_both_parents     1382 non-null   float64
 7   perc_pop2p_disability        1382 non-null   float64
 8   perc_pop1017_married         1382 non-null   float64
 9   perc_pop1217_birth           1382 non-null   float64
 10  perc_pop1464_working         1382 non-null   float64
 11  perc_pop10p_mobile_phone     1382 non-null   float64
 12  perc_hh_temp_dwelling        1382 non-null   float64
 13  perc_pop_5km_dist_primary    1382 non-null   float64
 14  perc_pop_5km_dist_secondary  1382 non-null   float64
 15  perc_pop_5km_dist_health     1382 non-null   float64
 16  perc_pop_5km_dist_police     1382 non-null   float64
 17  perc_hh_mosquito_net         1382 non-null   float64
 18  perc_hh_piped_water          1382 non-null   float64
 19  perc_hh_borehole             1382 non-null   float64
 20  perc_hh_toilet               1382 non-null   float64
 21  perc_hh_own_house            1382 non-null   float64
 22  perc_hh_own_tv               1382 non-null   float64
 23  perc_hh_bank_acc             1382 non-null   float64
 24  perc_hh_subs_farm            1382 non-null   float64
 25  perc_hh_less2meals           1382 non-null   float64
 26  perc_hh_electricity          1382 non-null   float64
 27  tot_pop_subcounty            1382 non-null   float64
dtypes: float64(27), object(1)
memory usage: 302.4+ KB
subcounty_demographic_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1382 entries, 0 to 1381
Data columns (total 28 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   subcounty                    1382 non-null   object 
 1   perc_hh_head_male            1382 non-null   float64
 2   perc_pop612_primary          1382 non-null   float64
 3   perc_pop1318_secondary       1382 non-null   float64
 4   perc_pop18p_illiterate       1382 non-null   float64
 5   perc_pop017_certificate      1382 non-null   float64
 6   perc_pop017_both_parents     1382 non-null   float64
 7   perc_pop2p_disability        1382 non-null   float64
 8   perc_pop1017_married         1382 non-null   float64
 9   perc_pop1217_birth           1382 non-null   float64
 10  perc_pop1464_working         1382 non-null   float64
 11  perc_pop10p_mobile_phone     1382 non-null   float64
 12  perc_hh_temp_dwelling        1382 non-null   float64
 13  perc_pop_5km_dist_primary    1382 non-null   float64
 14  perc_pop_5km_dist_secondary  1382 non-null   float64
 15  perc_pop_5km_dist_health     1382 non-null   float64
 16  perc_pop_5km_dist_police     1382 non-null   float64
 17  perc_hh_mosquito_net         1382 non-null   float64
 18  perc_hh_piped_water          1382 non-null   float64
 19  perc_hh_borehole             1382 non-null   float64
 20  perc_hh_toilet               1382 non-null   float64
 21  perc_hh_own_house            1382 non-null   float64
 22  perc_hh_own_tv               1382 non-null   float64
 23  perc_hh_bank_acc             1382 non-null   float64
 24  perc_hh_subs_farm            1382 non-null   float64
 25  perc_hh_less2meals           1382 non-null   float64
 26  perc_hh_electricity          1382 non-null   float64
 27  tot_pop_subcounty            1382 non-null   float64
dtypes: float64(27), object(1)
memory usage: 302.4+ KB
print(subcounty_demographic_clean.isna().sum().sum() > 0,
      subcounty_demographic_clean.duplicated().sum() > 0,
      subcounty_demographic_clean.T.duplicated().sum() > 0)
False False False
subcounty_demographic_clean.rename(
    columns={'subcounty': 'clean_adm4'}, inplace=True)
subcounty_demographic_clean.to_csv(data_filepath+'ta_2_subcounty_demographic_clean.csv')
Image(dictionary_filepath+"2A-Demographic-Dictionary.png")