We have manually downloaded regional demographic data for Uganda on our computer and will load it in this notebook to do some initial cleaning.

Accessing Demographic Data

We will clean and prepare regional demographic data for further analysis.

%run /Users/thomasadler/Desktop/futuristic-platipus/notebooks/ta_01_packages_functions.py

We have manually chosen the relevant columns from the Ugandan Statistics datasets for multiple reasons:

every region (4) has its own dataset, with inconsistent naming of columns and sheets
there are multiple sheets within each excel workbook, each with an inconsistent number of indexes
manually choosing an initial set of relevant variables is easier when visualizing the full names in an excel workbook and comparing those with other columns
the number of data points is manageable in excel and does not run the risk of the data being cut
minimal data processing has been carried out to avoid any mistakes

regions_df = ['eastern', 'western', 'central', 'northern']

eastern_df = pd.DataFrame(pd.read_excel(
    f'{data_filepath}{regions_df[0]}_demographic.xlsx'))
western_df = pd.DataFrame(pd.read_excel(
    f'{data_filepath}{regions_df[1]}_demographic.xlsx'))
central_df = pd.DataFrame(pd.read_excel(
    f'{data_filepath}{regions_df[2]}_demographic.xlsx'))
northern_df = pd.DataFrame(pd.read_excel(
    f'{data_filepath}{regions_df[3]}_demographic.xlsx'))

regions_df = [eastern_df, western_df, central_df, northern_df]

demographic = pd.concat(regions_df, axis=0)

# check
demographic.head()

total_demographic = demographic.copy()

total_demographic.head()

total_demographic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7557 entries, 0 to 1650
Data columns (total 29 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   subcounty                    7557 non-null   object 
 1   parish                       7557 non-null   object 
 2   tot_pop_parish               7557 non-null   int64  
 3   perc_hh_head_male            7557 non-null   float64
 4   perc_pop612_primary          7557 non-null   float64
 5   perc_pop1318_secondary       7557 non-null   float64
 6   perc_pop18p_illiterate       7557 non-null   float64
 7   perc_pop017_certificate      7557 non-null   float64
 8   perc_pop017_both_parents     7557 non-null   float64
 9   perc_pop2p_disability        7557 non-null   float64
 10  perc_pop1017_married         7557 non-null   float64
 11  perc_pop1217_birth           7557 non-null   float64
 12  perc_pop1464_working         7557 non-null   float64
 13  perc_pop10p_mobile_phone     7557 non-null   float64
 14  perc_hh_temp_dwelling        7502 non-null   float64
 15  perc_pop_5km_dist_primary    7556 non-null   float64
 16  perc_pop_5km_dist_secondary  7557 non-null   float64
 17  perc_pop_5km_dist_health     7557 non-null   float64
 18  perc_pop_5km_dist_police     7557 non-null   float64
 19  perc_hh_mosquito_net         7557 non-null   float64
 20  perc_hh_piped_water          7555 non-null   float64
 21  perc_hh_borehole             7552 non-null   float64
 22  perc_hh_toilet               7540 non-null   float64
 23  perc_hh_own_house            7557 non-null   float64
 24  perc_hh_own_tv               7557 non-null   float64
 25  perc_hh_bank_acc             7557 non-null   float64
 26  perc_hh_subs_farm            7557 non-null   float64
 27  perc_hh_less2meals           7557 non-null   float64
 28  perc_hh_electricity          7557 non-null   float64
dtypes: float64(26), int64(1), object(2)
memory usage: 1.7+ MB

total_demographic.isna().sum()

subcounty                       0
parish                          0
tot_pop_parish                  0
perc_hh_head_male               0
perc_pop612_primary             0
perc_pop1318_secondary          0
perc_pop18p_illiterate          0
perc_pop017_certificate         0
perc_pop017_both_parents        0
perc_pop2p_disability           0
perc_pop1017_married            0
perc_pop1217_birth              0
perc_pop1464_working            0
perc_pop10p_mobile_phone        0
perc_hh_temp_dwelling          55
perc_pop_5km_dist_primary       1
perc_pop_5km_dist_secondary     0
perc_pop_5km_dist_health        0
perc_pop_5km_dist_police        0
perc_hh_mosquito_net            0
perc_hh_piped_water             2
perc_hh_borehole                5
perc_hh_toilet                 17
perc_hh_own_house               0
perc_hh_own_tv                  0
perc_hh_bank_acc                0
perc_hh_subs_farm               0
perc_hh_less2meals              0
perc_hh_electricity             0
dtype: int64

We fill missing values of a parish by its subcounty average. Due to the low number of missing values, we assume that the average is a good representation of the subcounty's situation.

null_cols = total_demographic.isna().sum()
cols_to_impute = null_cols[null_cols > 0].index

# check
cols_to_impute

Index(['perc_hh_temp_dwelling', 'perc_pop_5km_dist_primary',
       'perc_hh_piped_water', 'perc_hh_borehole', 'perc_hh_toilet'],
      dtype='object')

for col in cols_to_impute:
    missing_to_regional_median(total_demographic, col, 'subcounty')

# check
total_demographic.isna().sum()

subcounty                      0
parish                         0
tot_pop_parish                 0
perc_hh_head_male              0
perc_pop612_primary            0
perc_pop1318_secondary         0
perc_pop18p_illiterate         0
perc_pop017_certificate        0
perc_pop017_both_parents       0
perc_pop2p_disability          0
perc_pop1017_married           0
perc_pop1217_birth             0
perc_pop1464_working           0
perc_pop10p_mobile_phone       0
perc_hh_temp_dwelling          7
perc_pop_5km_dist_primary      0
perc_pop_5km_dist_secondary    0
perc_pop_5km_dist_health       0
perc_pop_5km_dist_police       0
perc_hh_mosquito_net           0
perc_hh_piped_water            0
perc_hh_borehole               0
perc_hh_toilet                 2
perc_hh_own_house              0
perc_hh_own_tv                 0
perc_hh_bank_acc               0
perc_hh_subs_farm              0
perc_hh_less2meals             0
perc_hh_electricity            0
dtype: int64

null_cols2 = total_demographic.isna().sum()
cols_to_impute2 = null_cols[null_cols > 0].index

# check
cols_to_impute2

Index(['perc_hh_temp_dwelling', 'perc_pop_5km_dist_primary',
       'perc_hh_piped_water', 'perc_hh_borehole', 'perc_hh_toilet'],
      dtype='object')

For those that still have missing values, due to the fact that the whole subcounty does not have values, we fill the missing values with the country's average.

for col in cols_to_impute2:
    total_demographic[col] = total_demographic[col].fillna(
        total_demographic[col].mean())

# check no more null values
total_demographic.isna().sum().sum() == 0

True

total_demographic.duplicated().sum() == 0

True

total_demographic.T.duplicated().sum() == 0

True

We have information for each parish, however we need data for each subcounty. We will use population as a proxy to calculate the weight of each parish in a subcounty. We then calculate a weighted average of parishes to get subcounty values.

tot_pop_subcounty = total_demographic[[
    'subcounty', 'tot_pop_parish']].groupby('subcounty').sum()

# rename column
tot_pop_subcounty.columns = ['tot_pop_subcounty']

# check
tot_pop_subcounty

total_demographic = total_demographic.merge(
    tot_pop_subcounty, how='left', on='subcounty')

# check
total_demographic.head()

total_demographic['weight'] = total_demographic['tot_pop_parish'] / \
    total_demographic['tot_pop_subcounty']

# check all weights equal to one
total_demographic[['subcounty', 'weight']].groupby('subcounty').sum(
).sum() == len(total_demographic[['subcounty']].groupby('subcounty'))

weight    True
dtype: bool

total_demographic.iloc[:, 2:-1] = total_demographic.iloc[:,
                                                         2:-1].multiply(total_demographic['weight'], axis='index')

# check
total_demographic.head()

subcounty_demographic = total_demographic.groupby('subcounty').sum()

# check that no column is more than 100 (except for population)
subcounty_demographic.describe()

subcounty_demographic.reset_index(inplace=True)

# check
subcounty_demographic.head()

subcounty_demographic_clean = pd.DataFrame(
    subcounty_demographic.drop(columns=['weight', 'tot_pop_parish']))

# check current columns
subcounty_demographic_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1382 entries, 0 to 1381
Data columns (total 28 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   subcounty                    1382 non-null   object 
 1   perc_hh_head_male            1382 non-null   float64
 2   perc_pop612_primary          1382 non-null   float64
 3   perc_pop1318_secondary       1382 non-null   float64
 4   perc_pop18p_illiterate       1382 non-null   float64
 5   perc_pop017_certificate      1382 non-null   float64
 6   perc_pop017_both_parents     1382 non-null   float64
 7   perc_pop2p_disability        1382 non-null   float64
 8   perc_pop1017_married         1382 non-null   float64
 9   perc_pop1217_birth           1382 non-null   float64
 10  perc_pop1464_working         1382 non-null   float64
 11  perc_pop10p_mobile_phone     1382 non-null   float64
 12  perc_hh_temp_dwelling        1382 non-null   float64
 13  perc_pop_5km_dist_primary    1382 non-null   float64
 14  perc_pop_5km_dist_secondary  1382 non-null   float64
 15  perc_pop_5km_dist_health     1382 non-null   float64
 16  perc_pop_5km_dist_police     1382 non-null   float64
 17  perc_hh_mosquito_net         1382 non-null   float64
 18  perc_hh_piped_water          1382 non-null   float64
 19  perc_hh_borehole             1382 non-null   float64
 20  perc_hh_toilet               1382 non-null   float64
 21  perc_hh_own_house            1382 non-null   float64
 22  perc_hh_own_tv               1382 non-null   float64
 23  perc_hh_bank_acc             1382 non-null   float64
 24  perc_hh_subs_farm            1382 non-null   float64
 25  perc_hh_less2meals           1382 non-null   float64
 26  perc_hh_electricity          1382 non-null   float64
 27  tot_pop_subcounty            1382 non-null   float64
dtypes: float64(27), object(1)
memory usage: 302.4+ KB

subcounty_demographic_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1382 entries, 0 to 1381
Data columns (total 28 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   subcounty                    1382 non-null   object 
 1   perc_hh_head_male            1382 non-null   float64
 2   perc_pop612_primary          1382 non-null   float64
 3   perc_pop1318_secondary       1382 non-null   float64
 4   perc_pop18p_illiterate       1382 non-null   float64
 5   perc_pop017_certificate      1382 non-null   float64
 6   perc_pop017_both_parents     1382 non-null   float64
 7   perc_pop2p_disability        1382 non-null   float64
 8   perc_pop1017_married         1382 non-null   float64
 9   perc_pop1217_birth           1382 non-null   float64
 10  perc_pop1464_working         1382 non-null   float64
 11  perc_pop10p_mobile_phone     1382 non-null   float64
 12  perc_hh_temp_dwelling        1382 non-null   float64
 13  perc_pop_5km_dist_primary    1382 non-null   float64
 14  perc_pop_5km_dist_secondary  1382 non-null   float64
 15  perc_pop_5km_dist_health     1382 non-null   float64
 16  perc_pop_5km_dist_police     1382 non-null   float64
 17  perc_hh_mosquito_net         1382 non-null   float64
 18  perc_hh_piped_water          1382 non-null   float64
 19  perc_hh_borehole             1382 non-null   float64
 20  perc_hh_toilet               1382 non-null   float64
 21  perc_hh_own_house            1382 non-null   float64
 22  perc_hh_own_tv               1382 non-null   float64
 23  perc_hh_bank_acc             1382 non-null   float64
 24  perc_hh_subs_farm            1382 non-null   float64
 25  perc_hh_less2meals           1382 non-null   float64
 26  perc_hh_electricity          1382 non-null   float64
 27  tot_pop_subcounty            1382 non-null   float64
dtypes: float64(27), object(1)
memory usage: 302.4+ KB

print(subcounty_demographic_clean.isna().sum().sum() > 0,
      subcounty_demographic_clean.duplicated().sum() > 0,
      subcounty_demographic_clean.T.duplicated().sum() > 0)

False False False

subcounty_demographic_clean.rename(
    columns={'subcounty': 'clean_adm4'}, inplace=True)

subcounty_demographic_clean.to_csv(data_filepath+'ta_2_subcounty_demographic_clean.csv')

Image(dictionary_filepath+"2A-Demographic-Dictionary.png")

	subcounty	parish	tot_pop_parish	perc_hh_head_male	perc_pop612_primary	perc_pop1318_secondary	perc_pop18p_illiterate	perc_pop017_certificate	perc_pop017_both_parents	perc_pop2p_disability	...	perc_hh_mosquito_net	perc_hh_piped_water	perc_hh_borehole	perc_hh_toilet	perc_hh_own_house	perc_hh_own_tv	perc_hh_bank_acc	perc_hh_subs_farm	perc_hh_less2meals	perc_hh_electricity
0	Eastern Division	Naluwerere	5878	74.8	91.5	49.9	29.0	31.4	92.4	8.9	...	83.5	20.1	59.4	5.0	55.8	12.8	23.1	38.4	6.3	19.5
1	Eastern Division	Nkusi	7946	74.3	88.7	58.2	14.9	41.9	93.6	7.8	...	86.9	34.3	10.6	1.1	25.9	33.3	44.3	12.1	8.3	44.9
2	Western Division	Bwole	8370	74.5	90.9	56.8	14.1	41.9	92.8	9.7	...	89.7	29.7	26.9	1.2	38.4	27.8	43.8	23.7	8.4	37.4
3	Western Division	Ndifakulya	6553	73.4	87.6	55.0	16.2	46.6	93.7	10.3	...	87.5	22.8	22.6	1.8	37.1	27.1	37.4	28.6	6.7	35.1
4	Budhaya	Budhaya	5796	84.3	88.8	23.4	45.2	27.6	95.1	9.1	...	80.1	0.2	96.8	13.1	88.9	1.1	9.2	81.7	5.9	6.1

	subcounty	parish	tot_pop_parish	perc_hh_head_male	perc_pop612_primary	perc_pop1318_secondary	perc_pop18p_illiterate	perc_pop017_certificate	perc_pop017_both_parents	perc_pop2p_disability	...	perc_hh_mosquito_net	perc_hh_piped_water	perc_hh_borehole	perc_hh_toilet	perc_hh_own_house	perc_hh_own_tv	perc_hh_bank_acc	perc_hh_subs_farm	perc_hh_less2meals	perc_hh_electricity
0	Eastern Division	Naluwerere	5878	74.8	91.5	49.9	29.0	31.4	92.4	8.9	...	83.5	20.1	59.4	5.0	55.8	12.8	23.1	38.4	6.3	19.5
1	Eastern Division	Nkusi	7946	74.3	88.7	58.2	14.9	41.9	93.6	7.8	...	86.9	34.3	10.6	1.1	25.9	33.3	44.3	12.1	8.3	44.9
2	Western Division	Bwole	8370	74.5	90.9	56.8	14.1	41.9	92.8	9.7	...	89.7	29.7	26.9	1.2	38.4	27.8	43.8	23.7	8.4	37.4
3	Western Division	Ndifakulya	6553	73.4	87.6	55.0	16.2	46.6	93.7	10.3	...	87.5	22.8	22.6	1.8	37.1	27.1	37.4	28.6	6.7	35.1
4	Budhaya	Budhaya	5796	84.3	88.8	23.4	45.2	27.6	95.1	9.1	...	80.1	0.2	96.8	13.1	88.9	1.1	9.2	81.7	5.9	6.1

	tot_pop_subcounty
subcounty
Abako	24329
Abanga	17613
Abarilela	24290
Aber	33366
Abia	23436
...	...
Yumbe Town Council	34806
Zesui	10782
Zeu	35354
Zirobwe	44992
Zombo Town Council	12091

	subcounty	parish	tot_pop_parish	perc_hh_head_male	perc_pop612_primary	perc_pop1318_secondary	perc_pop18p_illiterate	perc_pop017_certificate	perc_pop017_both_parents	perc_pop2p_disability	...	perc_hh_piped_water	perc_hh_borehole	perc_hh_toilet	perc_hh_own_house	perc_hh_own_tv	perc_hh_bank_acc	perc_hh_subs_farm	perc_hh_less2meals	perc_hh_electricity	tot_pop_subcounty
0	Eastern Division	Naluwerere	5878	74.8	91.5	49.9	29.0	31.4	92.4	8.9	...	20.1	59.4	5.0	55.8	12.8	23.1	38.4	6.3	19.5	146794
1	Eastern Division	Nkusi	7946	74.3	88.7	58.2	14.9	41.9	93.6	7.8	...	34.3	10.6	1.1	25.9	33.3	44.3	12.1	8.3	44.9	146794
2	Western Division	Bwole	8370	74.5	90.9	56.8	14.1	41.9	92.8	9.7	...	29.7	26.9	1.2	38.4	27.8	43.8	23.7	8.4	37.4	169457
3	Western Division	Ndifakulya	6553	73.4	87.6	55.0	16.2	46.6	93.7	10.3	...	22.8	22.6	1.8	37.1	27.1	37.4	28.6	6.7	35.1	169457
4	Budhaya	Budhaya	5796	84.3	88.8	23.4	45.2	27.6	95.1	9.1	...	0.2	96.8	13.1	88.9	1.1	9.2	81.7	5.9	6.1	28472

	subcounty	parish	tot_pop_parish	perc_hh_head_male	perc_pop612_primary	perc_pop1318_secondary	perc_pop18p_illiterate	perc_pop017_certificate	perc_pop017_both_parents	perc_pop2p_disability	...	perc_hh_borehole	perc_hh_toilet	perc_hh_own_house	perc_hh_own_tv	perc_hh_bank_acc	perc_hh_subs_farm	perc_hh_less2meals	perc_hh_electricity	tot_pop_subcounty	weight
0	Eastern Division	Naluwerere	235.369865	2.995180	3.663890	1.998121	1.161233	1.257335	3.699928	0.356378	...	2.378525	0.200213	2.234372	0.512544	0.924982	1.537632	0.252268	0.780829	5878.0	0.040043
1	Eastern Division	Nkusi	430.119187	4.021880	4.801356	3.150382	0.806541	2.268059	5.066594	0.422216	...	0.573781	0.059543	1.401974	1.802538	2.397971	0.654976	0.449281	2.430449	7946.0	0.054130
2	Western Division	Bwole	413.419924	3.679783	4.489829	2.805526	0.696442	2.069569	4.583676	0.479113	...	1.328673	0.059272	1.896694	1.373127	2.163416	1.170616	0.414902	1.847300	8370.0	0.049393
3	Western Division	Ndifakulya	253.408292	2.838420	3.387543	2.126882	0.626463	1.802049	3.623433	0.398307	...	0.873955	0.069607	1.434678	1.047973	1.446280	1.105979	0.259093	1.357337	6553.0	0.038671
4	Budhaya	Budhaya	1179.882551	17.160818	18.076876	4.763501	9.201292	5.618488	19.359357	1.852473	...	19.705423	2.666746	18.097232	0.223925	1.872829	16.631540	1.201054	1.241767	5796.0	0.203568

	tot_pop_parish	perc_hh_head_male	perc_pop612_primary	perc_pop1318_secondary	perc_pop18p_illiterate	perc_pop017_certificate	perc_pop017_both_parents	perc_pop2p_disability	perc_pop1017_married	perc_pop1217_birth	...	perc_hh_borehole	perc_hh_toilet	perc_hh_own_house	perc_hh_own_tv	perc_hh_bank_acc	perc_hh_subs_farm	perc_hh_less2meals	perc_hh_electricity	tot_pop_subcounty	weight
count	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	...	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	1382.000000	1.382000e+03
mean	5325.183537	77.089347	80.701635	29.792665	33.362797	29.141930	91.772449	13.642423	6.564745	8.538391	...	37.962323	11.096495	81.550674	6.662546	17.433289	79.303604	11.223005	12.089956	25061.251809	1.000000e+00
std	4841.601497	5.125613	12.615068	9.577757	13.373853	14.973238	2.422723	5.800347	2.818645	3.914319	...	32.345151	16.196412	17.098956	10.063932	11.286255	20.531131	9.146918	12.956311	28004.997066	8.450033e-18
min	565.125115	43.296939	6.469903	3.019500	5.159501	1.896132	80.607712	2.400000	1.100000	2.060516	...	0.000000	0.000000	12.806064	0.197382	1.831688	1.191364	1.683737	0.483304	1981.000000	1.000000e+00
25%	2922.142298	73.772953	78.781441	23.387518	25.688698	17.218885	90.523072	10.025470	4.636206	5.890590	...	6.476420	2.043361	74.093670	1.523332	9.782145	76.522410	5.850080	4.837981	12767.750000	1.000000e+00
50%	4349.144386	77.310364	83.761165	28.671131	32.654889	27.438227	91.900018	12.706115	6.001666	7.714616	...	30.362959	5.215642	87.873951	3.117658	13.599117	87.469722	9.005092	7.857371	19268.000000	1.000000e+00
75%	6264.462068	80.759590	86.656814	35.128398	39.105892	38.857003	93.368579	16.113406	7.951506	10.177006	...	66.655302	12.242853	94.402075	7.030110	22.011153	92.201184	12.783530	13.648965	30635.000000	1.000000e+00
max	81962.015179	89.136705	94.120353	59.603670	95.163615	89.807095	96.771055	70.398504	24.485364	53.852961	...	99.463761	96.092403	99.200501	69.874708	72.581032	98.034043	70.340232	85.396092	425070.000000	1.000000e+00

	subcounty	tot_pop_parish	perc_hh_head_male	perc_pop612_primary	perc_pop1318_secondary	perc_pop18p_illiterate	perc_pop017_certificate	perc_pop017_both_parents	perc_pop2p_disability	perc_pop1017_married	...	perc_hh_borehole	perc_hh_toilet	perc_hh_own_house	perc_hh_own_tv	perc_hh_bank_acc	perc_hh_subs_farm	perc_hh_less2meals	perc_hh_electricity	tot_pop_subcounty	weight
0	Abako	4980.425131	77.188397	73.638242	23.234560	37.155288	40.631497	89.292223	19.016807	5.219919	...	27.780579	11.052814	97.636623	0.733992	9.180801	93.104768	11.556529	3.969103	24329.0	1.0
1	Abanga	3657.634645	74.663209	49.826100	8.726509	56.078805	6.268466	91.904968	16.344297	8.435729	...	4.249634	17.929563	94.744581	0.576716	9.267484	92.562215	4.859519	3.101715	17613.0	1.0
2	Abarilela	4316.878551	71.659304	82.324026	23.733903	43.352421	32.763751	92.361029	16.989201	4.838436	...	96.878987	32.313685	95.087715	1.206439	11.167011	95.398728	14.410782	13.117472	24290.0	1.0
3	Aber	8844.329617	75.875232	76.815372	22.464650	37.663891	39.250420	91.166310	12.758754	10.031098	...	72.613025	9.453135	95.480702	1.503926	8.704004	93.712237	6.937463	6.639306	33366.0	1.0
4	Abia	3923.494880	79.001784	74.058850	22.057540	35.890114	28.721663	89.266701	16.557975	9.301314	...	42.406439	14.932744	98.024441	0.811487	9.335049	94.918928	21.129971	4.331349	23436.0	1.0