2. Demographic data
Accessing demographic data from the Ugandan Bureau of Statistics
We have manually downloaded regional demographic data for Uganda on our computer and will load it in this notebook to do some initial cleaning.
We will clean and prepare regional demographic data for further analysis.
%run /Users/thomasadler/Desktop/futuristic-platipus/notebooks/ta_01_packages_functions.py
We have manually chosen the relevant columns from the Ugandan Statistics datasets for multiple reasons:
- every region (4) has its own dataset, with inconsistent naming of columns and sheets
- there are multiple sheets within each excel workbook, each with an inconsistent number of indexes
- manually choosing an initial set of relevant variables is easier when visualizing the full names in an excel workbook and comparing those with other columns
- the number of data points is manageable in excel and does not run the risk of the data being cut
- minimal data processing has been carried out to avoid any mistakes
regions_df = ['eastern', 'western', 'central', 'northern']
eastern_df = pd.DataFrame(pd.read_excel(
f'{data_filepath}{regions_df[0]}_demographic.xlsx'))
western_df = pd.DataFrame(pd.read_excel(
f'{data_filepath}{regions_df[1]}_demographic.xlsx'))
central_df = pd.DataFrame(pd.read_excel(
f'{data_filepath}{regions_df[2]}_demographic.xlsx'))
northern_df = pd.DataFrame(pd.read_excel(
f'{data_filepath}{regions_df[3]}_demographic.xlsx'))
regions_df = [eastern_df, western_df, central_df, northern_df]
demographic = pd.concat(regions_df, axis=0)
# check
demographic.head()
total_demographic = demographic.copy()
total_demographic.head()
total_demographic.info()
total_demographic.isna().sum()
We fill missing values of a parish by its subcounty average. Due to the low number of missing values, we assume that the average is a good representation of the subcounty's situation.
null_cols = total_demographic.isna().sum()
cols_to_impute = null_cols[null_cols > 0].index
# check
cols_to_impute
for col in cols_to_impute:
missing_to_regional_median(total_demographic, col, 'subcounty')
# check
total_demographic.isna().sum()
null_cols2 = total_demographic.isna().sum()
cols_to_impute2 = null_cols[null_cols > 0].index
# check
cols_to_impute2
For those that still have missing values, due to the fact that the whole subcounty does not have values, we fill the missing values with the country's average.
for col in cols_to_impute2:
total_demographic[col] = total_demographic[col].fillna(
total_demographic[col].mean())
# check no more null values
total_demographic.isna().sum().sum() == 0
total_demographic.duplicated().sum() == 0
total_demographic.T.duplicated().sum() == 0
We have information for each parish, however we need data for each subcounty. We will use population as a proxy to calculate the weight of each parish in a subcounty. We then calculate a weighted average of parishes to get subcounty values.
tot_pop_subcounty = total_demographic[[
'subcounty', 'tot_pop_parish']].groupby('subcounty').sum()
# rename column
tot_pop_subcounty.columns = ['tot_pop_subcounty']
# check
tot_pop_subcounty
total_demographic = total_demographic.merge(
tot_pop_subcounty, how='left', on='subcounty')
# check
total_demographic.head()
total_demographic['weight'] = total_demographic['tot_pop_parish'] / \
total_demographic['tot_pop_subcounty']
# check all weights equal to one
total_demographic[['subcounty', 'weight']].groupby('subcounty').sum(
).sum() == len(total_demographic[['subcounty']].groupby('subcounty'))
total_demographic.iloc[:, 2:-1] = total_demographic.iloc[:,
2:-1].multiply(total_demographic['weight'], axis='index')
# check
total_demographic.head()
subcounty_demographic = total_demographic.groupby('subcounty').sum()
# check that no column is more than 100 (except for population)
subcounty_demographic.describe()
subcounty_demographic.reset_index(inplace=True)
# check
subcounty_demographic.head()
subcounty_demographic_clean = pd.DataFrame(
subcounty_demographic.drop(columns=['weight', 'tot_pop_parish']))
# check current columns
subcounty_demographic_clean.info()
subcounty_demographic_clean.info()
print(subcounty_demographic_clean.isna().sum().sum() > 0,
subcounty_demographic_clean.duplicated().sum() > 0,
subcounty_demographic_clean.T.duplicated().sum() > 0)
subcounty_demographic_clean.rename(
columns={'subcounty': 'clean_adm4'}, inplace=True)
subcounty_demographic_clean.to_csv(data_filepath+'ta_2_subcounty_demographic_clean.csv')
Image(dictionary_filepath+"2A-Demographic-Dictionary.png")