Analysis of Child Marriage Data¶

Yunjeong Chang¶

Table of Contents:¶

  1. Introduction
  2. Data Collection and Processing
    • Set Standard Year Based on Data Frequency
    • Extract 40 Top & 40 Bottom Average Marriage Age Countries
    • GDP Data
    • Female Education Data
    • Intimate Partner Violence Rate Data
  3. Data Exploration and Analysis
    • Relationship between GDP and Marriage Age
    • Relationship between Female Education and Marriage Age
    • Relationship between Intimate Partner Violence Rate and Marriage Age
  4. Hypothesis Testing
    • Null Hypothesis Testing for 40 Bottom countries
    • Predicted Marriage Age for 40 Top countries
    • Null Hypothesis Testing for 40 Top countries
  5. Insights

1. Introduction¶

According to UNICEF, child marriage is defined as any formal or informal union between a child under the age of 18 and an adult or another child(https://www.unicef.org/protection/child-marriage), and it is mostly related to female children. Child marriage is one of the global problems such as poverty, human trafficking, and environmental pollution. Compared to other issues, many people are not aware of child marriage and why it matters. Among the many reasons, one of the root causes of child marriage is poverty. Families with financial difficulties often force their young daughters to get married in order to receive a dowry from the partner. Such child marriage due to parents' pressure often leads to sexual and emotional abuse of the child. In addition, children, who are not yet fully grown adults, are deprived of education and work opportunities due to early marriage. Since adolescence is not yet a period when children are physically and mentally mature, they may lose thier lives or experience indelible bodily or psychological wounds from childbirth complications if they become pregnant during the phase. If you want to know more problems about child marriage, visit following sites.

  • https://www.savethechildren.org/us/charity-stories/child-marriage-a-violation-of-child-rights
  • https://interactive.unwomen.org/multimedia/infographic/violenceagainstwomen/en/index.html#childmarriage
  • https://www.unfpa.org/sites/default/files/pub-pdf/MarryingTooYoung.pdf

Even at this very moment, many girls still suffer as victims of early marriage. It is not a problem that can be solved immediately, but it is important to raise awareness for this issue so that participants of this tutorial like you can join the crew and provide assistance by sharing this topic with others. As part of this tutorial, we will look into which countries have high child marriage rates and what their common characteristics are. Based on the processed data, we will discuss what solutions can be proposed in order to decrease the child marriage rate. Specifically, we will use the average female marriage age data of each country in order to find out which countries tend to marry at an early age. In addition, we will analyze how the factors such as the countries' Gross Domestic Product(GDP), education level, and Intimate Partner Violence (IPV) are related to average female marriage age.

2. Data Collection and Processing¶

Set Standard Year Based on Data Frequency¶

The first dataset is average marriage age for each country over the span of approximately 50 years from 1970 to 2018: (https://ourworldindata.org/grapher/age-at-marriage-women). Following code filters only the years after 2002 because we want to focus on recent data. However, data frequency is not consistent for all years. For instance, year 2018 has data from only 8 countries whereas year 2011 has 83 countries' data. Therefore, we should select a single standard year from 2002 to 2018. In order to achieve this goal, we will look at the data frequency for each year and select the year with the highest value.

In [958]:
import math
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from sklearn.linear_model import LinearRegression
import warnings
# ignore warnings
warnings.filterwarnings('ignore')

# display all rows for tables
pd.set_option("display.max_rows", None, "display.max_columns", None)

child_marriage = pd.read_csv("marriageage.csv")
# filter recent data after 2002
child_marriage_post_2002 = child_marriage[child_marriage.Year > 2002] 
count_by_year = child_marriage_post_2002['Year'].value_counts()
# organize grouped data into dataframe with proper column names
df_with_year_count_columns = pd.DataFrame({'Year':count_by_year.index, 'Count':count_by_year.values})
child_marriage_2002 = df_with_year_count_columns.sort_index()

# Create a plot with most data since 2002
plt.figure(figsize = (10, 10))
plt.bar(child_marriage_2002['Year'], child_marriage_2002['Count'])

# label the count values for each year
xlabels = []
for index, row in child_marriage_2002.iterrows():
    plt.text(row.Year, row.Count, row.Count, ha = 'center', bbox = dict(facecolor = 'red', alpha =0.8))
    xlabels.append(row.Year)
ax = plt.subplot()
ax.set_xticks(child_marriage_2002['Year'])
ax.set_xticklabels(xlabels)
    
plt.title("Number of data for each year")
plt.xlabel("Year")
plt.ylabel("Count")

child_marriage_2010 = child_marriage[child_marriage.Year == 2010]
child_marriage_2010 = child_marriage_2010.rename(columns={'Estimated average age at marriage, women': 'MarriageAge'})
child_marriage_2010 = child_marriage_2010.sort_values(by=['MarriageAge'])
child_marriage_2010 = child_marriage_2010.drop(columns=['Year'])

Since 2010 has highest frequency of 85 countries, we will use 2010 as the standard year. That is, subsequent data processing for GDP, Education, and IPV should be done for the same year.

Extract 40 Top & 40 Bottom Average Marriage Age Countries¶

Instead of using all 85 countries as a single chunk, we should separate them out into two groups. This is based on an assumption that factors such as GDP, Education, and IPV might have different impacts on the average marriage age for each group. Similar researches have been conducted for relationship between GDP and Happiness Index. Based on this analysis, after certain GDP level, it does not have much impact on Happiness index. Therefore, we would like to see how such factors relate to marriage age for each group.

Since there are 85 countries, we will split them into two groups of 40. "Top 40 Countries" represents countries with higher marriage age whereas "Bottom 40 Countries" corresponds to the countries with relatively lower marriage age.

In [959]:
top_40_countries = child_marriage_2010.tail(40)
top_40_countries
Out[959]:
Entity Code MarriageAge
350 Croatia HRV 27.100000
1589 Slovakia SVK 27.200001
1424 Portugal PRT 27.700001
93 Bahamas BHS 27.900000
1577 Singapore SGP 27.900000
399 Czechia CZE 27.900000
51 Australia AUS 27.900000
264 Cayman Islands CYM 28.000000
499 Estonia EST 28.000000
1303 Northern Mariana Islands MNP 28.100000
1268 New Zealand NZL 28.200001
750 Hungary HUN 28.299999
30 Aruba ABW 28.500000
141 Belize BLZ 28.600000
1133 Malta MLT 28.600000
892 Japan JPN 28.799999
949 Korea KOR 28.900000
1616 Slovenia SVN 29.000000
1434 Puerto Rico PRI 29.200001
648 Greece GRC 29.299999
135 Belgium BEL 29.400000
1021 Liechtenstein LIE 29.600000
1234 Netherlands NLD 29.799999
1724 Switzerland CHE 29.799999
291 Chile CHL 29.900000
1842 United Kingdom GBR 30.000000
548 Finland FIN 30.200001
1075 Luxembourg LUX 30.200001
863 Italy ITA 30.299999
151 Bermuda BMU 30.600000
568 France FRA 30.700001
1321 Norway NOR 30.799999
1660 Spain ESP 30.900000
1877 United States Virgin Islands VIR 30.900000
678 Greenland GRL 31.200001
432 Denmark DNK 31.200001
819 Ireland IRL 31.299999
110 Barbados BRB 31.900000
777 Iceland ISL 32.400002
1698 Sweden SWE 32.700001
In [960]:
bottom_40_countries = child_marriage_2010.head(40)
bottom_40_countries
Out[960]:
Entity Code MarriageAge
268 Central African Republic CAF 17.299999
272 Chad TCD 18.200001
216 Burkina Faso BFA 19.500000
1100 Malawi MWI 19.700001
410 Democratic Republic of Congo COD 20.200001
1931 Zimbabwe ZWE 20.600000
1570 Sierra Leone SLE 21.000000
514 Ethiopia ETH 21.200001
1 Afghanistan AFG 21.500000
1349 Panama PAN 21.600000
1552 Senegal SEN 21.600000
928 Kiribati KIR 21.600000
703 Guinea-Bissau GNB 21.700001
1675 Sudan SDN 22.000000
229 Cambodia KHM 22.000000
223 Burundi BDI 22.100000
1902 Vietnam VNM 22.299999
1764 Togo TGO 22.400000
1679 Suriname SUR 22.799999
1365 Peru PER 23.100000
911 Kazakhstan KAZ 23.100000
1343 Palestine PSE 23.200001
1175 Mongolia MNG 23.400000
1793 Turkey TUR 23.799999
175 Brazil BRA 23.900000
327 Costa Rica CRI 24.200001
1470 Russia RUS 24.400000
25 Armenia ARM 24.500000
22 Argentina ARG 24.600000
835 Israel ISR 24.799999
1757 Thailand THA 24.900000
1159 Micronesia (country) FSM 25.299999
1462 Romania ROU 25.600000
1331 Oman OMN 25.600000
1105 Malaysia MYS 25.700001
1559 Serbia SRB 26.000000
1868 United States USA 26.100000
1397 Poland POL 26.100000
508 Eswatini SWZ 26.200001
1152 Mexico MEX 26.200001
In [961]:
print("Average Marriage Age's mean and standard deviation for top 40 countries: {}, {}".format(top_40_countries.MarriageAge.mean(), top_40_countries.MarriageAge.std()))
print("Average Marriage Age's mean and standard deviation for bottom 40 countries: {}, {}".format(bottom_40_countries.MarriageAge.mean(), bottom_40_countries.MarriageAge.std()))
Average Marriage Age's mean and standard deviation for top 40 countries: 29.4725001, 1.4487816634005557
Average Marriage Age's mean and standard deviation for bottom 40 countries: 23.0000001, 2.290224467780595

The highest marriage age is 32.7 in Sweden. On the other hand, the lowest average marriage age is 17.29 in Central African Republic. The difference between the maximum and minimum marriage age is 15.41.

GDP¶

It is widely known and agreed that GDP is one of the crucial indicators of the country's economic power. As discussed above, the major reason for child marriage is the family's financial situation. Since countries with higher GDP are less likely to accept dowry, we would like to consider GDP as a contributing factor of child marriage.

The data is found from the following source: (https://data.worldbank.org/indicator/NY.GDP.MKTP.CD). First, we will extract GDP value of corresponding top 40 and bottom 40 countries based on their country codes for the year 2010. Then, we will add the corresponding values as part of existing dataframe by adding a GDP column.

In [962]:
gdp = pd.read_csv("gdp.csv", on_bad_lines='skip')
gdp_2010 = gdp[['Country Code', '2010']]

for index, row in top_40_countries.iterrows():
    # filter 2010 gdp data for the matching country code
    gdp_data = gdp_2010.loc[gdp_2010['Country Code'] == row.Code, '2010']
    # divided by billion for better readability
    top_40_countries.at[index, 'GDP'] = gdp_data.values[0] / 1000000000

for index, row in bottom_40_countries.iterrows():
    gdp_data = gdp_2010.loc[gdp_2010['Country Code'] == row.Code, '2010']
    bottom_40_countries.at[index, 'GDP'] = gdp_data.values[0] / 1000000000

top_40_countries
Out[962]:
Entity Code MarriageAge GDP
350 Croatia HRV 27.100000 60.426019
1589 Slovakia SVK 27.200001 90.801178
1424 Portugal PRT 27.700001 238.113003
93 Bahamas BHS 27.900000 10.095760
1577 Singapore SGP 27.900000 239.809388
399 Czechia CZE 27.900000 209.069941
51 Australia AUS 27.900000 1147.589183
264 Cayman Islands CYM 28.000000 4.156991
499 Estonia EST 28.000000 19.523477
1303 Northern Mariana Islands MNP 28.100000 0.799000
1268 New Zealand NZL 28.200001 146.517541
750 Hungary HUN 28.299999 132.231134
30 Aruba ABW 28.500000 2.390503
141 Belize BLZ 28.600000 1.377177
1133 Malta MLT 28.600000 9.035932
892 Japan JPN 28.799999 5759.071769
949 Korea KOR 28.900000 1144.066965
1616 Slovenia SVN 29.000000 48.208240
1434 Puerto Rico PRI 29.200001 98.381300
648 Greece GRC 29.299999 297.124962
135 Belgium BEL 29.400000 481.420883
1021 Liechtenstein LIE 29.600000 5.082366
1234 Netherlands NLD 29.799999 847.380859
1724 Switzerland CHE 29.799999 603.434493
291 Chile CHL 29.900000 218.537551
1842 United Kingdom GBR 30.000000 2491.110093
548 Finland FIN 30.200001 249.424311
1075 Luxembourg LUX 30.200001 56.213986
863 Italy ITA 30.299999 2136.099955
151 Bermuda BMU 30.600000 6.634526
568 France FRA 30.700001 2645.187882
1321 Norway NOR 30.799999 428.757038
1660 Spain ESP 30.900000 1422.108200
1877 United States Virgin Islands VIR 30.900000 4.324000
678 Greenland GRL 31.200001 2.503156
432 Denmark DNK 31.200001 321.995279
819 Ireland IRL 31.299999 221.876011
110 Barbados BRB 31.900000 4.530000
777 Iceland ISL 32.400002 13.751162
1698 Sweden SWE 32.700001 495.812559
In [963]:
bottom_40_countries
Out[963]:
Entity Code MarriageAge GDP
268 Central African Republic CAF 17.299999 2.142591
272 Chad TCD 18.200001 10.668103
216 Burkina Faso BFA 19.500000 10.109619
1100 Malawi MWI 19.700001 6.959656
410 Democratic Republic of Congo COD 20.200001 21.565720
1931 Zimbabwe ZWE 20.600000 12.041655
1570 Sierra Leone SLE 21.000000 2.578026
514 Ethiopia ETH 21.200001 29.933790
1 Afghanistan AFG 21.500000 15.856679
1349 Panama PAN 21.600000 29.440288
1552 Senegal SEN 21.600000 16.121315
928 Kiribati KIR 21.600000 0.156120
703 Guinea-Bissau GNB 21.700001 0.849878
1675 Sudan SDN 22.000000 58.962978
229 Cambodia KHM 22.000000 11.242275
223 Burundi BDI 22.100000 2.032135
1902 Vietnam VNM 22.299999 115.931750
1764 Togo TGO 22.400000 3.429461
1679 Suriname SUR 22.799999 4.368398
1365 Peru PER 23.100000 147.528937
911 Kazakhstan KAZ 23.100000 148.047348
1343 Palestine PSE 23.200001 9.681500
1175 Mongolia MNG 23.400000 7.189482
1793 Turkey TUR 23.799999 776.992600
175 Brazil BRA 23.900000 2208.838109
327 Costa Rica CRI 24.200001 37.658615
1470 Russia RUS 24.400000 1524.917468
25 Armenia ARM 24.500000 9.260285
22 Argentina ARG 24.600000 423.627422
835 Israel ISR 24.799999 234.654590
1757 Thailand THA 24.900000 341.104820
1159 Micronesia (country) FSM 25.299999 0.296944
1462 Romania ROU 25.600000 166.309355
1331 Oman OMN 25.600000 64.993498
1105 Malaysia MYS 25.700001 255.016609
1559 Serbia SRB 26.000000 41.819469
1868 United States USA 26.100000 14992.052727
1397 Poland POL 26.100000 479.834179
508 Eswatini SWZ 26.200001 4.438778
1152 Mexico MEX 26.200001 1057.801296
In [964]:
print("GDP's mean and standard deviation for top 40 countries: {}, {}".format(top_40_countries.GDP.mean(), top_40_countries.GDP.std()))
print("GDP's mean and standard deviation for bottom 40 countries: {}, {}".format(bottom_40_countries.GDP.mean(), bottom_40_countries.GDP.std()))
GDP's mean and standard deviation for top 40 countries: 557.8743444109135, 1079.5461180363852
GDP's mean and standard deviation for bottom 40 countries: 582.1613617224921, 2379.8165378475837

Female Education Level¶

Another probable indicator is the level of female education. In most countries, if children complete mandatory education prior to marriage, it is very less likely for them to marry at young age. For example, in the United States, most states require that children from age 6 to 17 must attend schools.

The following dataset is from UNESCO: (http://data.uis.unesco.org). Each value corresponds to the gross percentage of women who are enrolled in secondary education. Note that such percentage can go over 100% due to reasons such as the inclusion of over-aged and under-aged students because of following factors: early or late entrants, and grade repetition.

Since some of data corresponding to year 2010 are missing, we will calculate average value from 2008 to 2012 instead.

In [965]:
years = ['2008', '2009', '2010', '2011', '2012']
columns = np.append(['Country Code'], years)
education = pd.read_csv("female_education.csv", on_bad_lines='skip')
education_2008_2012 = education[columns]

for index, row in top_40_countries.iterrows():
    values = []
    for year in years:
        value = education_2008_2012.loc[education_2008_2012['Country Code'] == row.Code, year]
        # use only existing data 
        if not math.isnan(value):
            values.append(value)
    if len(values) == 0:
        top_40_countries.loc[index, 'Education'] = np.nan
    else:
        top_40_countries.loc[index, 'Education'] = np.sum(values)/len(values)

for index, row in bottom_40_countries.iterrows():
    values = []
    for year in years:
        value = education_2008_2012.loc[education_2008_2012['Country Code'] == row.Code, year]
        if not math.isnan(value):
            values.append(value)
    if len(values) == 0:
        bottom_40_countries.loc[index, 'Education'] = np.nan
    else:
        bottom_40_countries.loc[index, 'Education'] = np.sum(values)/len(values)

top_40_countries
Out[965]:
Entity Code MarriageAge GDP Education
350 Croatia HRV 27.100000 60.426019 101.479875
1589 Slovakia SVK 27.200001 90.801178 92.627159
1424 Portugal PRT 27.700001 238.113003 107.878590
93 Bahamas BHS 27.900000 10.095760 89.653389
1577 Singapore SGP 27.900000 239.809388 NaN
399 Czechia CZE 27.900000 209.069941 95.394522
51 Australia AUS 27.900000 1147.589183 NaN
264 Cayman Islands CYM 28.000000 4.156991 81.848961
499 Estonia EST 28.000000 19.523477 105.432600
1303 Northern Mariana Islands MNP 28.100000 0.799000 NaN
1268 New Zealand NZL 28.200001 146.517541 122.132478
750 Hungary HUN 28.299999 132.231134 96.114362
30 Aruba ABW 28.500000 2.390503 101.445366
141 Belize BLZ 28.600000 1.377177 78.050067
1133 Malta MLT 28.600000 9.035932 96.253578
892 Japan JPN 28.799999 5759.071769 NaN
949 Korea KOR 28.900000 1144.066965 96.175832
1616 Slovenia SVN 29.000000 48.208240 98.524147
1434 Puerto Rico PRI 29.200001 98.381300 86.932426
648 Greece GRC 29.299999 297.124962 103.172908
135 Belgium BEL 29.400000 481.420883 168.041019
1021 Liechtenstein LIE 29.600000 5.082366 100.130093
1234 Netherlands NLD 29.799999 847.380859 123.054074
1724 Switzerland CHE 29.799999 603.434493 94.394710
291 Chile CHL 29.900000 218.537551 91.442189
1842 United Kingdom GBR 30.000000 2491.110093 97.737492
548 Finland FIN 30.200001 249.424311 110.547345
1075 Luxembourg LUX 30.200001 56.213986 101.570770
863 Italy ITA 30.299999 2136.099955 101.594011
151 Bermuda BMU 30.600000 6.634526 82.988594
568 France FRA 30.700001 2645.187882 106.648727
1321 Norway NOR 30.799999 428.757038 110.841403
1660 Spain ESP 30.900000 1422.108200 120.583470
1877 United States Virgin Islands VIR 30.900000 4.324000 NaN
678 Greenland GRL 31.200001 2.503156 NaN
432 Denmark DNK 31.200001 321.995279 121.352472
819 Ireland IRL 31.299999 221.876011 118.526894
110 Barbados BRB 31.900000 4.530000 104.351807
777 Iceland ISL 32.400002 13.751162 108.613681
1698 Sweden SWE 32.700001 495.812559 98.319490
In [966]:
bottom_40_countries
Out[966]:
Entity Code MarriageAge GDP Education
268 Central African Republic CAF 17.299999 2.142591 11.464193
272 Chad TCD 18.200001 10.668103 13.395092
216 Burkina Faso BFA 19.500000 10.109619 19.264774
1100 Malawi MWI 19.700001 6.959656 31.446800
410 Democratic Republic of Congo COD 20.200001 21.565720 30.457006
1931 Zimbabwe ZWE 20.600000 12.041655 50.154339
1570 Sierra Leone SLE 21.000000 2.578026 37.286970
514 Ethiopia ETH 21.200001 29.933790 31.734772
1 Afghanistan AFG 21.500000 15.856679 31.832814
1349 Panama PAN 21.600000 29.440288 72.731840
1552 Senegal SEN 21.600000 16.121315 36.028431
928 Kiribati KIR 21.600000 0.156120 91.443108
703 Guinea-Bissau GNB 21.700001 0.849878 NaN
1675 Sudan SDN 22.000000 58.962978 38.035416
229 Cambodia KHM 22.000000 11.242275 41.599239
223 Burundi BDI 22.100000 2.032135 20.830005
1902 Vietnam VNM 22.299999 115.931750 NaN
1764 Togo TGO 22.400000 3.429461 NaN
1679 Suriname SUR 22.799999 4.368398 80.066982
1365 Peru PER 23.100000 147.528937 92.318985
911 Kazakhstan KAZ 23.100000 148.047348 100.571630
1343 Palestine PSE 23.200001 9.681500 89.516943
1175 Mongolia MNG 23.400000 7.189482 97.961914
1793 Turkey TUR 23.799999 776.992600 81.419664
175 Brazil BRA 23.900000 2208.838109 100.510277
327 Costa Rica CRI 24.200001 37.658615 102.470738
1470 Russia RUS 24.400000 1524.917468 87.992987
25 Armenia ARM 24.500000 9.260285 97.644978
22 Argentina ARG 24.600000 423.627422 103.883884
835 Israel ISR 24.799999 234.654590 104.193341
1757 Thailand THA 24.900000 341.104820 85.234282
1159 Micronesia (country) FSM 25.299999 0.296944 NaN
1462 Romania ROU 25.600000 166.309355 95.950938
1331 Oman OMN 25.600000 64.993498 96.810646
1105 Malaysia MYS 25.700001 255.016609 80.759233
1559 Serbia SRB 26.000000 41.819469 93.420276
1868 United States USA 26.100000 14992.052727 NaN
1397 Poland POL 26.100000 479.834179 95.984723
508 Eswatini SWZ 26.200001 4.438778 64.199763
1152 Mexico MEX 26.200001 1057.801296 90.859625
In [967]:
print("{} countries have missing female education value".format(len(top_40_countries) - len(top_40_countries.dropna())))
print("{} countries have missing female education value".format(len(bottom_40_countries) - len(bottom_40_countries.dropna())))
# drop countries with missing education data
top_40_countries = top_40_countries.dropna()
bottom_40_countries = bottom_40_countries.dropna()
6 countries have missing female education value
5 countries have missing female education value

As shown, there are 11 countries without female education data. That is, all of them do not have education data from 2008 to 2012.

Missing value techniques such as mean, hot-deck, and cold-deck imputation does not apply for our situation mostly because each country's education level cannot easily be compared or related to each other. For instance, using a global average value for missing country such as Vietnam would likely result in skewed result. Therefore, the best way would be to drop such countries from analysis and instead increase the sample size. Note that this tutorial originally analyzed 40 countries, but the sample size was increased to 80 to account for missing data.

Intimate Partner Violence Rate¶

Based on reports from global organizations such as United Nations International Children's Emergency Fund(UNICEF) and World Health Organization(WHO), young women, who are forcefully married to older husbands, are likely to experience domestic violence. Based on such findings, we would like to specifically observe the relationship between IPV and average female marriage age.

The following data from WHO is the percentage of women, who experienced intimate partner violence, by country: (https://srhr.org/vaw-data/data). Unfortunately, the page is a dynamically generated webpage, and it was not easy to crawl data; if you like to learn how to crawl from such pages, here is a helpful link: (https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_dynamic_websites.htm). Therefore, I collected the data manually by inserting each coutnry's data one at a time and saved it as CSV file.

In [968]:
ipv = pd.read_csv("ipv.csv", on_bad_lines='skip')

for index, row in top_40_countries.iterrows():
    ipv_value = ipv.loc[ipv['Country Code'] == row.Code, 'IPV']
    # only populate non missing data
    if len(ipv_value.values) != 0:
        top_40_countries.at[index, 'IPV'] = ipv_value.values[0]

for index, row in bottom_40_countries.iterrows():
    ipv_value = ipv.loc[ipv['Country Code'] == row.Code, 'IPV']
    if len(ipv_value.values) != 0:
        bottom_40_countries.at[index, 'IPV'] = ipv_value.values[0] 

top_40_countries
Out[968]:
Entity Code MarriageAge GDP Education IPV
350 Croatia HRV 27.100000 60.426019 101.479875 13.0
1589 Slovakia SVK 27.200001 90.801178 92.627159 18.0
1424 Portugal PRT 27.700001 238.113003 107.878590 18.0
93 Bahamas BHS 27.900000 10.095760 89.653389 NaN
399 Czechia CZE 27.900000 209.069941 95.394522 22.0
264 Cayman Islands CYM 28.000000 4.156991 81.848961 NaN
499 Estonia EST 28.000000 19.523477 105.432600 21.0
1268 New Zealand NZL 28.200001 146.517541 122.132478 23.0
750 Hungary HUN 28.299999 132.231134 96.114362 19.0
30 Aruba ABW 28.500000 2.390503 101.445366 NaN
141 Belize BLZ 28.600000 1.377177 78.050067 24.0
1133 Malta MLT 28.600000 9.035932 96.253578 17.0
949 Korea KOR 28.900000 1144.066965 96.175832 8.0
1616 Slovenia SVN 29.000000 48.208240 98.524147 18.0
1434 Puerto Rico PRI 29.200001 98.381300 86.932426 NaN
648 Greece GRC 29.299999 297.124962 103.172908 18.0
135 Belgium BEL 29.400000 481.420883 168.041019 22.0
1021 Liechtenstein LIE 29.600000 5.082366 100.130093 NaN
1234 Netherlands NLD 29.799999 847.380859 123.054074 21.0
1724 Switzerland CHE 29.799999 603.434493 94.394710 12.0
291 Chile CHL 29.900000 218.537551 91.442189 21.0
1842 United Kingdom GBR 30.000000 2491.110093 97.737492 24.0
548 Finland FIN 30.200001 249.424311 110.547345 23.0
1075 Luxembourg LUX 30.200001 56.213986 101.570770 20.0
863 Italy ITA 30.299999 2136.099955 101.594011 16.0
151 Bermuda BMU 30.600000 6.634526 82.988594 NaN
568 France FRA 30.700001 2645.187882 106.648727 22.0
1321 Norway NOR 30.799999 428.757038 110.841403 20.0
1660 Spain ESP 30.900000 1422.108200 120.583470 15.0
432 Denmark DNK 31.200001 321.995279 121.352472 23.0
819 Ireland IRL 31.299999 221.876011 118.526894 16.0
110 Barbados BRB 31.900000 4.530000 104.351807 NaN
777 Iceland ISL 32.400002 13.751162 108.613681 21.0
1698 Sweden SWE 32.700001 495.812559 98.319490 21.0
In [969]:
bottom_40_countries
Out[969]:
Entity Code MarriageAge GDP Education IPV
268 Central African Republic CAF 17.299999 2.142591 11.464193 29.0
272 Chad TCD 18.200001 10.668103 13.395092 29.0
216 Burkina Faso BFA 19.500000 10.109619 19.264774 19.0
1100 Malawi MWI 19.700001 6.959656 31.446800 30.0
410 Democratic Republic of Congo COD 20.200001 21.565720 30.457006 47.0
1931 Zimbabwe ZWE 20.600000 12.041655 50.154339 35.0
1570 Sierra Leone SLE 21.000000 2.578026 37.286970 36.0
514 Ethiopia ETH 21.200001 29.933790 31.734772 37.0
1 Afghanistan AFG 21.500000 15.856679 31.832814 46.0
1349 Panama PAN 21.600000 29.440288 72.731840 16.0
1552 Senegal SEN 21.600000 16.121315 36.028431 24.0
928 Kiribati KIR 21.600000 0.156120 91.443108 53.0
1675 Sudan SDN 22.000000 58.962978 38.035416 17.0
229 Cambodia KHM 22.000000 11.242275 41.599239 19.0
223 Burundi BDI 22.100000 2.032135 20.830005 40.0
1679 Suriname SUR 22.799999 4.368398 80.066982 28.0
1365 Peru PER 23.100000 147.528937 92.318985 38.0
911 Kazakhstan KAZ 23.100000 148.047348 100.571630 16.0
1343 Palestine PSE 23.200001 9.681500 89.516943 29.0
1175 Mongolia MNG 23.400000 7.189482 97.961914 27.0
1793 Turkey TUR 23.799999 776.992600 81.419664 32.0
175 Brazil BRA 23.900000 2208.838109 100.510277 23.0
327 Costa Rica CRI 24.200001 37.658615 102.470738 27.0
1470 Russia RUS 24.400000 1524.917468 87.992987 NaN
25 Armenia ARM 24.500000 9.260285 97.644978 10.0
22 Argentina ARG 24.600000 423.627422 103.883884 27.0
835 Israel ISR 24.799999 234.654590 104.193341 6.0
1757 Thailand THA 24.900000 341.104820 85.234282 NaN
1462 Romania ROU 25.600000 166.309355 95.950938 18.0
1331 Oman OMN 25.600000 64.993498 96.810646 NaN
1105 Malaysia MYS 25.700001 255.016609 80.759233 19.0
1559 Serbia SRB 26.000000 41.819469 93.420276 17.0
1397 Poland POL 26.100000 479.834179 95.984723 13.0
508 Eswatini SWZ 26.200001 4.438778 64.199763 18.0
1152 Mexico MEX 26.200001 1057.801296 90.859625 24.0
In [970]:
print("{} countries have missing female education value".format(len(top_40_countries) - len(top_40_countries.dropna())))
print("{} countries have missing female education value".format(len(bottom_40_countries) - len(bottom_40_countries.dropna())))

top_40_countries.loc[top_40_countries['Code'] == 'BRB', 'IPV'] = 30
bottom_40_countries.loc[bottom_40_countries['Code'] == 'RUS', 'IPV'] = 21
bottom_40_countries.loc[bottom_40_countries['Code'] == 'THA', 'IPV'] = 41

# drop countries with missing IPV data
top_40_countries = top_40_countries.dropna()
bottom_40_countries = bottom_40_countries.dropna()
7 countries have missing female education value
3 countries have missing female education value

As shown, there are 10 countries without IPV data.

Out of missing value techniques, we would like to use cold-deck imputation by utilizing the following source. Data for Russia, Barbados, and Thailand are additionally found and inserted. We should not use mean or hot-deck imputation because each country's IPV rate would be drastically different due to many other factors. Thus, for values that we cannot find from the additional source, we will omit the corresponding countries from the analysis. As discussed above, this was another primary reason to increase the sample size.

3. Data Exploration and Analysis¶

Relationship between GDP and Marriage Age¶

In [971]:
# Scatter plot for top 40 countries 
plt.figure(figsize = (15, 15))
ax = plt.subplot(111)
for index, row in top_40_countries.iterrows():
    ax.scatter(row['GDP'], row['MarriageAge'], s=100)
    ax.annotate(row.Entity, xy=(row['GDP'], row['MarriageAge']), xytext=(row['GDP'] - 100, row['MarriageAge'] + 0.1))
plt.title("Relationship between GDP and Marriage Age - Top 40")
plt.xlabel("GDP (billion $)")
plt.ylabel("Marriage Age")
plt.show()

# Scatter plot for bottom 40 countries 
plt.figure(figsize = (15, 15))
ax = plt.subplot(111)
for index, row in bottom_40_countries.iterrows():
    ax.scatter(row['GDP'], row['MarriageAge'], s=100)
    ax.annotate(row.Entity, xy=(row['GDP'], row['MarriageAge']), xytext=(row['GDP'] - 100, row['MarriageAge'] + 0.1)) 
plt.title("Relationship between GDP and Marriage Age - Bottom 40")
plt.xlabel("GDP (billion $)")
plt.ylabel("Marriage Age")
plt.show()

Observations about relationship between GDP and marriage age

  • Bottom 40 countries tend to have lower GDP compared to top 40 countries. We can observe the almost vertical looking trend for more than half of the countries.
  • Top countries have more distribution across the graph in general. It also seems to have linear trend between two variables, but country such as France has high GDP but relatively low average marriage age.
  • Bottom group has some outstanding countries, possibly outliers, such as Russia and Brazil.

Relationship between Female Education and Marriage Age¶

In [972]:
# Scatter plot for top 40 countries 
plt.figure(figsize = (15, 15))
ax = plt.subplot(111)
for index, row in top_40_countries.iterrows():
    if not math.isnan(row['Education']):
        ax.scatter(row['Education'], row['MarriageAge'], s=100)
        ax.annotate(row.Entity, xy=(row['Education'], row['MarriageAge']), xytext=(row['Education'] - 3.5, row['MarriageAge'] + 0.1))
plt.title("Relationship between Female Education and Marriage Age - Top 40")
plt.xlabel("Education")
plt.ylabel("Marriage Age")
plt.show()

# Scatter plot for bottom 40 countries 
plt.figure(figsize = (15, 15))
ax = plt.subplot(111)
for index, row in bottom_40_countries.iterrows():
    if not math.isnan(row['Education']):
        ax.scatter(row['Education'], row['MarriageAge'], s=100)
        ax.annotate(row.Entity, xy=(row['Education'], row['MarriageAge']), xytext=(row['Education'] - 3.5, row['MarriageAge'] + 0.1))
plt.title("Relationship between Female Education and Marriage Age - Bottom 40")
plt.xlabel("Education")
plt.ylabel("Marriage Age")
plt.show()

Observations about relationship between female education and marriage age

  • Top 40 Countries tend to have higher education level compared to bottom 40 countries.
  • Bottom 40 countries group shows strong linear trend for most countries. It is somewhat obvious that countries with high female education level tend to have higher marriage age.
  • Top 40 countries group displays a almost vertical (or no) trend between two variables. Most countries seem to properly register children into secondary education. In addition, there seems to be an outlier such as Belgium, which tends to have lower marriage age.

Relationship between Intimate Partner Violence Rate and Marriage Age¶

In [973]:
fig, ax = plt.subplots(1, 1, figsize=(20,20))
plt.title('IPV for Each Country')
fig = sns.barplot(y=ipv['Name'], x=ipv['IPV'])

# adding a line to represent mean IVP percentage
ax.axvline(ipv['IPV'].mean(), color="blue", linewidth=2)
plt.xlabel("IPV rate")
plt.ylabel("Country")
plt.show()

Observations about relationship between IPV and marriage age

  • Roughly speaking, countries at the top (from Central African Republic to Mexico) correspond to "Bottom 40 Countries" whereas countries at the bottom (from Croatia to Sweden) correspond to "Top 40 Countries".
  • The blue line indicates the mean IPV value of all the countries, which is approximately 23.18
  • Excluding few countries, IPV rate is larger than the average value (crossing blue line) for the former group. On other hand, countries belonging to the latter group tend to have lower IPV rate compared to the blue line.

4. Hypothesis Testing¶

Null Hypothesis Testing for 40 Bottom countries¶

The null hypothesis is that there is correlation between three factors (GDP, Education, and IPV) and marriage age. The alternative is that there is no correlation between three factors and marriage age.

In [974]:
from scipy.stats import pearsonr

# GDP
pearsons_r, p_value = pearsonr(bottom_40_countries['GDP'], bottom_40_countries['MarriageAge'])
print("Pearson's r and p value for bottom 40 countries' Marriage Age vs GDP: {}, {}".format(pearsons_r, p_value))

# Education
pearsons_r, p_value = pearsonr(bottom_40_countries['Education'], bottom_40_countries['MarriageAge'])
print("Pearson's r and p value for bottom 40 countries' Marriage Age vs Education: {}, {}".format(pearsons_r, p_value))

# IPV
pearsons_r, p_value = pearsonr(bottom_40_countries['IPV'], bottom_40_countries['MarriageAge'])
print("Pearson's r and p value for bottom 40 countries' Marriage Age vs IPV: {}, {}".format(pearsons_r, p_value))
Pearson's r and p value for bottom 40 countries' Marriage Age vs GDP: 0.36468455396131355, 0.033958557499972765
Pearson's r and p value for bottom 40 countries' Marriage Age vs Education: 0.8049608227420808, 9.526527603805122e-09
Pearson's r and p value for bottom 40 countries' Marriage Age vs IPV: -0.41926374066357386, 0.01358020237875942

Based on the above p-values, 0.034, 0, and 0.013 for GDP, Education, and IPV accordingly, we can reject null hypothesis because all of them fall under our cutoff which is 5%. In other words, for the bottom 40 countries, all three factors are highly related to women's average marriage age. Based on such findings, we would like to create and draw linear regression models as following to visually observe the fitting trend.

In [975]:
plt.figure(figsize = (5, 5))
sns.regplot(x = 'GDP', y = 'MarriageAge', data=top_40_countries)
plt.title("Relationship between GDP and Marriage Age - Bottom 40")
plt.xlabel("GDP (billion $)")
plt.ylabel("Marriage Age")
plt.show()

plt.figure(figsize = (5, 5))
sns.regplot(x = 'Education', y = 'MarriageAge', data=top_40_countries)
plt.title("Relationship between Female Education and Marriage Age - Bottom 40")
plt.xlabel("Education")
plt.ylabel("Marriage Age")
plt.show()

plt.figure(figsize = (5, 5))
sns.regplot(x = 'IPV', y = 'MarriageAge', data=top_40_countries)
plt.title("Relationship between Female IPV and Marriage Age - Bottom 40")
plt.xlabel("IPV")
plt.ylabel("Marriage Age")
plt.show()

By observing the graphs above, it is even more obvious that the data for all three criteria have strong linear relationship with our dependent variable, average marriage age.

Predicted Marriage Age for 40 Top countries¶

Now, we would like to observe if our regression models for bottom countries fit the data for top countries. In order to accomplish this goal, we will render distribution plot for actual and expected average age for each contributing factor and analyze if they share similar mean value and shapes.

In [976]:
# GDP
lm = LinearRegression()
lm.fit(bottom_40_countries[['GDP']], bottom_40_countries['MarriageAge'])
predicted = lm.predict(top_40_countries[['GDP']])

f, ax = plt.subplots(figsize=(10,10))
plt.title('Actual and Predicted Marriage Age Distribution for GDP model')
sns.distplot(top_40_countries['MarriageAge'], hist=False, label="Actual", ax=ax)
sns.distplot(predicted, hist=False, label="Predictions", ax=ax)
plt.legend()
plt.show()

# Education
lm = LinearRegression()
lm.fit(bottom_40_countries[['Education']], bottom_40_countries['MarriageAge'])
predicted = lm.predict(top_40_countries[['Education']])

f, ax = plt.subplots(figsize=(10,10))
plt.title('Actual and Predicted Marriage Age Distribution for Education model')
sns.distplot(top_40_countries['MarriageAge'], hist=False, label="Actual", ax=ax)
sns.distplot(predicted, hist=False, label="Predictions", ax=ax)
plt.legend()
plt.show()

# IPV
lm = LinearRegression()
lm.fit(bottom_40_countries[['IPV']], bottom_40_countries['MarriageAge'])
predicted = lm.predict(top_40_countries[['IPV']])

f, ax = plt.subplots(figsize=(10,10))
plt.title('Actual and Predicted Marriage Age Distribution for IPV model')
sns.distplot(top_40_countries['MarriageAge'], hist=False, label="Actual", ax=ax)
sns.distplot(predicted, hist=False, label="Predictions", ax=ax)
plt.legend()
plt.show()

As indicated by above graphs, actual and expected marriage age distributions are drastically different for all three aspects. Among three, GDP model has the most similar spread of distribution but their mean values are quite different: 23 vs 30. For education and IPV, both mean value and spread are quite off for actual and expected values. Predicted values tend to have narrow spread around mean value of 24 and 25 whereas actual distribution has wider spread around mean value of 30. As a result, we can conclude that linear model of bottom countries does not fit into top countries well.

Null Hypothesis Testing for 40 Top countries¶

Since we observed that linear regression model of bottom countries does not fit data of top countries, we would like to see if linear regression is not a good model at all for top countries' data. We will evaluate our null hypothesis based on p-values once again.

In [977]:
# GDP
pearsons_r, p_value = pearsonr(top_40_countries['GDP'], top_40_countries['MarriageAge'])
print("Pearson's r and p value for top 40 countries' Marriage Age vs GDP: {}, {}".format(pearsons_r, p_value))

# Education
pearsons_r, p_value = pearsonr(top_40_countries['Education'], top_40_countries['MarriageAge'])
print("Pearson's r and p value for top 40 countries' Marriage Age vs Education: {}, {}".format(pearsons_r, p_value))

# IPV
pearsons_r, p_value = pearsonr(top_40_countries['IPV'], top_40_countries['MarriageAge'])
print("Pearson's r and p value for top 40 countries' Marriage Age vs IPV: {}, {}".format(pearsons_r, p_value))
Pearson's r and p value for top 40 countries' Marriage Age vs GDP: 0.23994458082697234, 0.21875549286589246
Pearson's r and p value for top 40 countries' Marriage Age vs Education: 0.19423195926348696, 0.3219815182448406
Pearson's r and p value for top 40 countries' Marriage Age vs IPV: 0.28131984532033316, 0.1469993294672017

Based on the above p-values, 0.218, 0.322, and 0.147 for GDP, Education, and IPV accordingly, we cannot reject null hypothesis because all of them are above our 5% cutoff. That is, all three factors are not related to women's average marriage age for top 40 countries.

5. Insights¶

In conclusion, GDP, education level, and IPV are deeply related to marriage age for bottom 40 countries. For the three contributing factors we chose for child marriage, further discussion about solution for each of them will decrease child marriage rates.

If women receive a high level of education and continue to pursue careers, it is likely for them to achieve personal goals and possibly postpone their marriages. In addition, if the country's economic level exceeds a certain point, families will not need marriage dowry. In order for women to pioneer their lives independently, governments should spread child marriage issues and put their best efforts in order to improve the situation.

As shown above, female education level impacts marriage age. One way to prevent child marriage is that government should regulate the required education period for both male and female. If it becomes the law and the culture of the country, children will receive all the required education. As a result, the total length of period for female children to stay in school will also be longer. Therefore, it is expected that the proportion of child marriage will decrease at the age of under 18.

In countries with low marriage age, the rate of Intimate Partner Violence tend to be higher. It is not adequate to generalize that the younger one gets married, the more domestic violence one gets, but the lower the age, the more likely it is to be an unwanted marriage because they are often forced by their parents. Although many countries are already preventing early marriage by law, there is still a widespread practice. Likewise, domestic violence will be prohibited by law, but the high IPV data represents that young married girls are likely to be outside the law. The way to improve this is to change perceptions. If more people learn that child marriage is a definite world problem, they will one day take it seriously even in the country where child marriage is prevalent.

Note that GDP, education, and IPV do not demonstrate relationship with average marriage age for top 40 countries. Similar to Happiness Index, this indicates that marriage age is not affected by such factors for countries with higher overall wealth. That is, once one has enough amount of finance, education, and low rate of violence, marriage age is possibly impacted by other factors. It would be exciting to find out what such aspects are.

Another interesting aspect is that global organizations have recently provided many reports that child marriage has increased for countries, which received relatively high COVID-19 impacts. Since not enough public data is available as of now, it is not easy to analyze the relationship between average marriage age and COVID-19 mortality or Intensive Care Unit (ICU) rates. However, when such data is available, it would be very fascinating to do similar research on such topic.