According to UNICEF, child marriage is defined as any formal or informal union between a child under the age of 18 and an adult or another child(https://www.unicef.org/protection/child-marriage), and it is mostly related to female children. Child marriage is one of the global problems such as poverty, human trafficking, and environmental pollution. Compared to other issues, many people are not aware of child marriage and why it matters. Among the many reasons, one of the root causes of child marriage is poverty. Families with financial difficulties often force their young daughters to get married in order to receive a dowry from the partner. Such child marriage due to parents' pressure often leads to sexual and emotional abuse of the child. In addition, children, who are not yet fully grown adults, are deprived of education and work opportunities due to early marriage. Since adolescence is not yet a period when children are physically and mentally mature, they may lose thier lives or experience indelible bodily or psychological wounds from childbirth complications if they become pregnant during the phase. If you want to know more problems about child marriage, visit following sites.
Even at this very moment, many girls still suffer as victims of early marriage. It is not a problem that can be solved immediately, but it is important to raise awareness for this issue so that participants of this tutorial like you can join the crew and provide assistance by sharing this topic with others. As part of this tutorial, we will look into which countries have high child marriage rates and what their common characteristics are. Based on the processed data, we will discuss what solutions can be proposed in order to decrease the child marriage rate. Specifically, we will use the average female marriage age data of each country in order to find out which countries tend to marry at an early age. In addition, we will analyze how the factors such as the countries' Gross Domestic Product(GDP), education level, and Intimate Partner Violence (IPV) are related to average female marriage age.
The first dataset is average marriage age for each country over the span of approximately 50 years from 1970 to 2018: (https://ourworldindata.org/grapher/age-at-marriage-women). Following code filters only the years after 2002 because we want to focus on recent data. However, data frequency is not consistent for all years. For instance, year 2018 has data from only 8 countries whereas year 2011 has 83 countries' data. Therefore, we should select a single standard year from 2002 to 2018. In order to achieve this goal, we will look at the data frequency for each year and select the year with the highest value.
import math
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from sklearn.linear_model import LinearRegression
import warnings
# ignore warnings
warnings.filterwarnings('ignore')
# display all rows for tables
pd.set_option("display.max_rows", None, "display.max_columns", None)
child_marriage = pd.read_csv("marriageage.csv")
# filter recent data after 2002
child_marriage_post_2002 = child_marriage[child_marriage.Year > 2002]
count_by_year = child_marriage_post_2002['Year'].value_counts()
# organize grouped data into dataframe with proper column names
df_with_year_count_columns = pd.DataFrame({'Year':count_by_year.index, 'Count':count_by_year.values})
child_marriage_2002 = df_with_year_count_columns.sort_index()
# Create a plot with most data since 2002
plt.figure(figsize = (10, 10))
plt.bar(child_marriage_2002['Year'], child_marriage_2002['Count'])
# label the count values for each year
xlabels = []
for index, row in child_marriage_2002.iterrows():
plt.text(row.Year, row.Count, row.Count, ha = 'center', bbox = dict(facecolor = 'red', alpha =0.8))
xlabels.append(row.Year)
ax = plt.subplot()
ax.set_xticks(child_marriage_2002['Year'])
ax.set_xticklabels(xlabels)
plt.title("Number of data for each year")
plt.xlabel("Year")
plt.ylabel("Count")
child_marriage_2010 = child_marriage[child_marriage.Year == 2010]
child_marriage_2010 = child_marriage_2010.rename(columns={'Estimated average age at marriage, women': 'MarriageAge'})
child_marriage_2010 = child_marriage_2010.sort_values(by=['MarriageAge'])
child_marriage_2010 = child_marriage_2010.drop(columns=['Year'])
Since 2010 has highest frequency of 85 countries, we will use 2010 as the standard year. That is, subsequent data processing for GDP, Education, and IPV should be done for the same year.
Instead of using all 85 countries as a single chunk, we should separate them out into two groups. This is based on an assumption that factors such as GDP, Education, and IPV might have different impacts on the average marriage age for each group. Similar researches have been conducted for relationship between GDP and Happiness Index. Based on this analysis, after certain GDP level, it does not have much impact on Happiness index. Therefore, we would like to see how such factors relate to marriage age for each group.
Since there are 85 countries, we will split them into two groups of 40. "Top 40 Countries" represents countries with higher marriage age whereas "Bottom 40 Countries" corresponds to the countries with relatively lower marriage age.
top_40_countries = child_marriage_2010.tail(40)
top_40_countries
Entity | Code | MarriageAge | |
---|---|---|---|
350 | Croatia | HRV | 27.100000 |
1589 | Slovakia | SVK | 27.200001 |
1424 | Portugal | PRT | 27.700001 |
93 | Bahamas | BHS | 27.900000 |
1577 | Singapore | SGP | 27.900000 |
399 | Czechia | CZE | 27.900000 |
51 | Australia | AUS | 27.900000 |
264 | Cayman Islands | CYM | 28.000000 |
499 | Estonia | EST | 28.000000 |
1303 | Northern Mariana Islands | MNP | 28.100000 |
1268 | New Zealand | NZL | 28.200001 |
750 | Hungary | HUN | 28.299999 |
30 | Aruba | ABW | 28.500000 |
141 | Belize | BLZ | 28.600000 |
1133 | Malta | MLT | 28.600000 |
892 | Japan | JPN | 28.799999 |
949 | Korea | KOR | 28.900000 |
1616 | Slovenia | SVN | 29.000000 |
1434 | Puerto Rico | PRI | 29.200001 |
648 | Greece | GRC | 29.299999 |
135 | Belgium | BEL | 29.400000 |
1021 | Liechtenstein | LIE | 29.600000 |
1234 | Netherlands | NLD | 29.799999 |
1724 | Switzerland | CHE | 29.799999 |
291 | Chile | CHL | 29.900000 |
1842 | United Kingdom | GBR | 30.000000 |
548 | Finland | FIN | 30.200001 |
1075 | Luxembourg | LUX | 30.200001 |
863 | Italy | ITA | 30.299999 |
151 | Bermuda | BMU | 30.600000 |
568 | France | FRA | 30.700001 |
1321 | Norway | NOR | 30.799999 |
1660 | Spain | ESP | 30.900000 |
1877 | United States Virgin Islands | VIR | 30.900000 |
678 | Greenland | GRL | 31.200001 |
432 | Denmark | DNK | 31.200001 |
819 | Ireland | IRL | 31.299999 |
110 | Barbados | BRB | 31.900000 |
777 | Iceland | ISL | 32.400002 |
1698 | Sweden | SWE | 32.700001 |
bottom_40_countries = child_marriage_2010.head(40)
bottom_40_countries
Entity | Code | MarriageAge | |
---|---|---|---|
268 | Central African Republic | CAF | 17.299999 |
272 | Chad | TCD | 18.200001 |
216 | Burkina Faso | BFA | 19.500000 |
1100 | Malawi | MWI | 19.700001 |
410 | Democratic Republic of Congo | COD | 20.200001 |
1931 | Zimbabwe | ZWE | 20.600000 |
1570 | Sierra Leone | SLE | 21.000000 |
514 | Ethiopia | ETH | 21.200001 |
1 | Afghanistan | AFG | 21.500000 |
1349 | Panama | PAN | 21.600000 |
1552 | Senegal | SEN | 21.600000 |
928 | Kiribati | KIR | 21.600000 |
703 | Guinea-Bissau | GNB | 21.700001 |
1675 | Sudan | SDN | 22.000000 |
229 | Cambodia | KHM | 22.000000 |
223 | Burundi | BDI | 22.100000 |
1902 | Vietnam | VNM | 22.299999 |
1764 | Togo | TGO | 22.400000 |
1679 | Suriname | SUR | 22.799999 |
1365 | Peru | PER | 23.100000 |
911 | Kazakhstan | KAZ | 23.100000 |
1343 | Palestine | PSE | 23.200001 |
1175 | Mongolia | MNG | 23.400000 |
1793 | Turkey | TUR | 23.799999 |
175 | Brazil | BRA | 23.900000 |
327 | Costa Rica | CRI | 24.200001 |
1470 | Russia | RUS | 24.400000 |
25 | Armenia | ARM | 24.500000 |
22 | Argentina | ARG | 24.600000 |
835 | Israel | ISR | 24.799999 |
1757 | Thailand | THA | 24.900000 |
1159 | Micronesia (country) | FSM | 25.299999 |
1462 | Romania | ROU | 25.600000 |
1331 | Oman | OMN | 25.600000 |
1105 | Malaysia | MYS | 25.700001 |
1559 | Serbia | SRB | 26.000000 |
1868 | United States | USA | 26.100000 |
1397 | Poland | POL | 26.100000 |
508 | Eswatini | SWZ | 26.200001 |
1152 | Mexico | MEX | 26.200001 |
print("Average Marriage Age's mean and standard deviation for top 40 countries: {}, {}".format(top_40_countries.MarriageAge.mean(), top_40_countries.MarriageAge.std()))
print("Average Marriage Age's mean and standard deviation for bottom 40 countries: {}, {}".format(bottom_40_countries.MarriageAge.mean(), bottom_40_countries.MarriageAge.std()))
Average Marriage Age's mean and standard deviation for top 40 countries: 29.4725001, 1.4487816634005557 Average Marriage Age's mean and standard deviation for bottom 40 countries: 23.0000001, 2.290224467780595
The highest marriage age is 32.7 in Sweden. On the other hand, the lowest average marriage age is 17.29 in Central African Republic. The difference between the maximum and minimum marriage age is 15.41.
It is widely known and agreed that GDP is one of the crucial indicators of the country's economic power. As discussed above, the major reason for child marriage is the family's financial situation. Since countries with higher GDP are less likely to accept dowry, we would like to consider GDP as a contributing factor of child marriage.
The data is found from the following source: (https://data.worldbank.org/indicator/NY.GDP.MKTP.CD). First, we will extract GDP value of corresponding top 40 and bottom 40 countries based on their country codes for the year 2010. Then, we will add the corresponding values as part of existing dataframe by adding a GDP column.
gdp = pd.read_csv("gdp.csv", on_bad_lines='skip')
gdp_2010 = gdp[['Country Code', '2010']]
for index, row in top_40_countries.iterrows():
# filter 2010 gdp data for the matching country code
gdp_data = gdp_2010.loc[gdp_2010['Country Code'] == row.Code, '2010']
# divided by billion for better readability
top_40_countries.at[index, 'GDP'] = gdp_data.values[0] / 1000000000
for index, row in bottom_40_countries.iterrows():
gdp_data = gdp_2010.loc[gdp_2010['Country Code'] == row.Code, '2010']
bottom_40_countries.at[index, 'GDP'] = gdp_data.values[0] / 1000000000
top_40_countries
Entity | Code | MarriageAge | GDP | |
---|---|---|---|---|
350 | Croatia | HRV | 27.100000 | 60.426019 |
1589 | Slovakia | SVK | 27.200001 | 90.801178 |
1424 | Portugal | PRT | 27.700001 | 238.113003 |
93 | Bahamas | BHS | 27.900000 | 10.095760 |
1577 | Singapore | SGP | 27.900000 | 239.809388 |
399 | Czechia | CZE | 27.900000 | 209.069941 |
51 | Australia | AUS | 27.900000 | 1147.589183 |
264 | Cayman Islands | CYM | 28.000000 | 4.156991 |
499 | Estonia | EST | 28.000000 | 19.523477 |
1303 | Northern Mariana Islands | MNP | 28.100000 | 0.799000 |
1268 | New Zealand | NZL | 28.200001 | 146.517541 |
750 | Hungary | HUN | 28.299999 | 132.231134 |
30 | Aruba | ABW | 28.500000 | 2.390503 |
141 | Belize | BLZ | 28.600000 | 1.377177 |
1133 | Malta | MLT | 28.600000 | 9.035932 |
892 | Japan | JPN | 28.799999 | 5759.071769 |
949 | Korea | KOR | 28.900000 | 1144.066965 |
1616 | Slovenia | SVN | 29.000000 | 48.208240 |
1434 | Puerto Rico | PRI | 29.200001 | 98.381300 |
648 | Greece | GRC | 29.299999 | 297.124962 |
135 | Belgium | BEL | 29.400000 | 481.420883 |
1021 | Liechtenstein | LIE | 29.600000 | 5.082366 |
1234 | Netherlands | NLD | 29.799999 | 847.380859 |
1724 | Switzerland | CHE | 29.799999 | 603.434493 |
291 | Chile | CHL | 29.900000 | 218.537551 |
1842 | United Kingdom | GBR | 30.000000 | 2491.110093 |
548 | Finland | FIN | 30.200001 | 249.424311 |
1075 | Luxembourg | LUX | 30.200001 | 56.213986 |
863 | Italy | ITA | 30.299999 | 2136.099955 |
151 | Bermuda | BMU | 30.600000 | 6.634526 |
568 | France | FRA | 30.700001 | 2645.187882 |
1321 | Norway | NOR | 30.799999 | 428.757038 |
1660 | Spain | ESP | 30.900000 | 1422.108200 |
1877 | United States Virgin Islands | VIR | 30.900000 | 4.324000 |
678 | Greenland | GRL | 31.200001 | 2.503156 |
432 | Denmark | DNK | 31.200001 | 321.995279 |
819 | Ireland | IRL | 31.299999 | 221.876011 |
110 | Barbados | BRB | 31.900000 | 4.530000 |
777 | Iceland | ISL | 32.400002 | 13.751162 |
1698 | Sweden | SWE | 32.700001 | 495.812559 |
bottom_40_countries
Entity | Code | MarriageAge | GDP | |
---|---|---|---|---|
268 | Central African Republic | CAF | 17.299999 | 2.142591 |
272 | Chad | TCD | 18.200001 | 10.668103 |
216 | Burkina Faso | BFA | 19.500000 | 10.109619 |
1100 | Malawi | MWI | 19.700001 | 6.959656 |
410 | Democratic Republic of Congo | COD | 20.200001 | 21.565720 |
1931 | Zimbabwe | ZWE | 20.600000 | 12.041655 |
1570 | Sierra Leone | SLE | 21.000000 | 2.578026 |
514 | Ethiopia | ETH | 21.200001 | 29.933790 |
1 | Afghanistan | AFG | 21.500000 | 15.856679 |
1349 | Panama | PAN | 21.600000 | 29.440288 |
1552 | Senegal | SEN | 21.600000 | 16.121315 |
928 | Kiribati | KIR | 21.600000 | 0.156120 |
703 | Guinea-Bissau | GNB | 21.700001 | 0.849878 |
1675 | Sudan | SDN | 22.000000 | 58.962978 |
229 | Cambodia | KHM | 22.000000 | 11.242275 |
223 | Burundi | BDI | 22.100000 | 2.032135 |
1902 | Vietnam | VNM | 22.299999 | 115.931750 |
1764 | Togo | TGO | 22.400000 | 3.429461 |
1679 | Suriname | SUR | 22.799999 | 4.368398 |
1365 | Peru | PER | 23.100000 | 147.528937 |
911 | Kazakhstan | KAZ | 23.100000 | 148.047348 |
1343 | Palestine | PSE | 23.200001 | 9.681500 |
1175 | Mongolia | MNG | 23.400000 | 7.189482 |
1793 | Turkey | TUR | 23.799999 | 776.992600 |
175 | Brazil | BRA | 23.900000 | 2208.838109 |
327 | Costa Rica | CRI | 24.200001 | 37.658615 |
1470 | Russia | RUS | 24.400000 | 1524.917468 |
25 | Armenia | ARM | 24.500000 | 9.260285 |
22 | Argentina | ARG | 24.600000 | 423.627422 |
835 | Israel | ISR | 24.799999 | 234.654590 |
1757 | Thailand | THA | 24.900000 | 341.104820 |
1159 | Micronesia (country) | FSM | 25.299999 | 0.296944 |
1462 | Romania | ROU | 25.600000 | 166.309355 |
1331 | Oman | OMN | 25.600000 | 64.993498 |
1105 | Malaysia | MYS | 25.700001 | 255.016609 |
1559 | Serbia | SRB | 26.000000 | 41.819469 |
1868 | United States | USA | 26.100000 | 14992.052727 |
1397 | Poland | POL | 26.100000 | 479.834179 |
508 | Eswatini | SWZ | 26.200001 | 4.438778 |
1152 | Mexico | MEX | 26.200001 | 1057.801296 |
print("GDP's mean and standard deviation for top 40 countries: {}, {}".format(top_40_countries.GDP.mean(), top_40_countries.GDP.std()))
print("GDP's mean and standard deviation for bottom 40 countries: {}, {}".format(bottom_40_countries.GDP.mean(), bottom_40_countries.GDP.std()))
GDP's mean and standard deviation for top 40 countries: 557.8743444109135, 1079.5461180363852 GDP's mean and standard deviation for bottom 40 countries: 582.1613617224921, 2379.8165378475837
Another probable indicator is the level of female education. In most countries, if children complete mandatory education prior to marriage, it is very less likely for them to marry at young age. For example, in the United States, most states require that children from age 6 to 17 must attend schools.
The following dataset is from UNESCO: (http://data.uis.unesco.org). Each value corresponds to the gross percentage of women who are enrolled in secondary education. Note that such percentage can go over 100% due to reasons such as the inclusion of over-aged and under-aged students because of following factors: early or late entrants, and grade repetition.
Since some of data corresponding to year 2010 are missing, we will calculate average value from 2008 to 2012 instead.
years = ['2008', '2009', '2010', '2011', '2012']
columns = np.append(['Country Code'], years)
education = pd.read_csv("female_education.csv", on_bad_lines='skip')
education_2008_2012 = education[columns]
for index, row in top_40_countries.iterrows():
values = []
for year in years:
value = education_2008_2012.loc[education_2008_2012['Country Code'] == row.Code, year]
# use only existing data
if not math.isnan(value):
values.append(value)
if len(values) == 0:
top_40_countries.loc[index, 'Education'] = np.nan
else:
top_40_countries.loc[index, 'Education'] = np.sum(values)/len(values)
for index, row in bottom_40_countries.iterrows():
values = []
for year in years:
value = education_2008_2012.loc[education_2008_2012['Country Code'] == row.Code, year]
if not math.isnan(value):
values.append(value)
if len(values) == 0:
bottom_40_countries.loc[index, 'Education'] = np.nan
else:
bottom_40_countries.loc[index, 'Education'] = np.sum(values)/len(values)
top_40_countries
Entity | Code | MarriageAge | GDP | Education | |
---|---|---|---|---|---|
350 | Croatia | HRV | 27.100000 | 60.426019 | 101.479875 |
1589 | Slovakia | SVK | 27.200001 | 90.801178 | 92.627159 |
1424 | Portugal | PRT | 27.700001 | 238.113003 | 107.878590 |
93 | Bahamas | BHS | 27.900000 | 10.095760 | 89.653389 |
1577 | Singapore | SGP | 27.900000 | 239.809388 | NaN |
399 | Czechia | CZE | 27.900000 | 209.069941 | 95.394522 |
51 | Australia | AUS | 27.900000 | 1147.589183 | NaN |
264 | Cayman Islands | CYM | 28.000000 | 4.156991 | 81.848961 |
499 | Estonia | EST | 28.000000 | 19.523477 | 105.432600 |
1303 | Northern Mariana Islands | MNP | 28.100000 | 0.799000 | NaN |
1268 | New Zealand | NZL | 28.200001 | 146.517541 | 122.132478 |
750 | Hungary | HUN | 28.299999 | 132.231134 | 96.114362 |
30 | Aruba | ABW | 28.500000 | 2.390503 | 101.445366 |
141 | Belize | BLZ | 28.600000 | 1.377177 | 78.050067 |
1133 | Malta | MLT | 28.600000 | 9.035932 | 96.253578 |
892 | Japan | JPN | 28.799999 | 5759.071769 | NaN |
949 | Korea | KOR | 28.900000 | 1144.066965 | 96.175832 |
1616 | Slovenia | SVN | 29.000000 | 48.208240 | 98.524147 |
1434 | Puerto Rico | PRI | 29.200001 | 98.381300 | 86.932426 |
648 | Greece | GRC | 29.299999 | 297.124962 | 103.172908 |
135 | Belgium | BEL | 29.400000 | 481.420883 | 168.041019 |
1021 | Liechtenstein | LIE | 29.600000 | 5.082366 | 100.130093 |
1234 | Netherlands | NLD | 29.799999 | 847.380859 | 123.054074 |
1724 | Switzerland | CHE | 29.799999 | 603.434493 | 94.394710 |
291 | Chile | CHL | 29.900000 | 218.537551 | 91.442189 |
1842 | United Kingdom | GBR | 30.000000 | 2491.110093 | 97.737492 |
548 | Finland | FIN | 30.200001 | 249.424311 | 110.547345 |
1075 | Luxembourg | LUX | 30.200001 | 56.213986 | 101.570770 |
863 | Italy | ITA | 30.299999 | 2136.099955 | 101.594011 |
151 | Bermuda | BMU | 30.600000 | 6.634526 | 82.988594 |
568 | France | FRA | 30.700001 | 2645.187882 | 106.648727 |
1321 | Norway | NOR | 30.799999 | 428.757038 | 110.841403 |
1660 | Spain | ESP | 30.900000 | 1422.108200 | 120.583470 |
1877 | United States Virgin Islands | VIR | 30.900000 | 4.324000 | NaN |
678 | Greenland | GRL | 31.200001 | 2.503156 | NaN |
432 | Denmark | DNK | 31.200001 | 321.995279 | 121.352472 |
819 | Ireland | IRL | 31.299999 | 221.876011 | 118.526894 |
110 | Barbados | BRB | 31.900000 | 4.530000 | 104.351807 |
777 | Iceland | ISL | 32.400002 | 13.751162 | 108.613681 |
1698 | Sweden | SWE | 32.700001 | 495.812559 | 98.319490 |
bottom_40_countries
Entity | Code | MarriageAge | GDP | Education | |
---|---|---|---|---|---|
268 | Central African Republic | CAF | 17.299999 | 2.142591 | 11.464193 |
272 | Chad | TCD | 18.200001 | 10.668103 | 13.395092 |
216 | Burkina Faso | BFA | 19.500000 | 10.109619 | 19.264774 |
1100 | Malawi | MWI | 19.700001 | 6.959656 | 31.446800 |
410 | Democratic Republic of Congo | COD | 20.200001 | 21.565720 | 30.457006 |
1931 | Zimbabwe | ZWE | 20.600000 | 12.041655 | 50.154339 |
1570 | Sierra Leone | SLE | 21.000000 | 2.578026 | 37.286970 |
514 | Ethiopia | ETH | 21.200001 | 29.933790 | 31.734772 |
1 | Afghanistan | AFG | 21.500000 | 15.856679 | 31.832814 |
1349 | Panama | PAN | 21.600000 | 29.440288 | 72.731840 |
1552 | Senegal | SEN | 21.600000 | 16.121315 | 36.028431 |
928 | Kiribati | KIR | 21.600000 | 0.156120 | 91.443108 |
703 | Guinea-Bissau | GNB | 21.700001 | 0.849878 | NaN |
1675 | Sudan | SDN | 22.000000 | 58.962978 | 38.035416 |
229 | Cambodia | KHM | 22.000000 | 11.242275 | 41.599239 |
223 | Burundi | BDI | 22.100000 | 2.032135 | 20.830005 |
1902 | Vietnam | VNM | 22.299999 | 115.931750 | NaN |
1764 | Togo | TGO | 22.400000 | 3.429461 | NaN |
1679 | Suriname | SUR | 22.799999 | 4.368398 | 80.066982 |
1365 | Peru | PER | 23.100000 | 147.528937 | 92.318985 |
911 | Kazakhstan | KAZ | 23.100000 | 148.047348 | 100.571630 |
1343 | Palestine | PSE | 23.200001 | 9.681500 | 89.516943 |
1175 | Mongolia | MNG | 23.400000 | 7.189482 | 97.961914 |
1793 | Turkey | TUR | 23.799999 | 776.992600 | 81.419664 |
175 | Brazil | BRA | 23.900000 | 2208.838109 | 100.510277 |
327 | Costa Rica | CRI | 24.200001 | 37.658615 | 102.470738 |
1470 | Russia | RUS | 24.400000 | 1524.917468 | 87.992987 |
25 | Armenia | ARM | 24.500000 | 9.260285 | 97.644978 |
22 | Argentina | ARG | 24.600000 | 423.627422 | 103.883884 |
835 | Israel | ISR | 24.799999 | 234.654590 | 104.193341 |
1757 | Thailand | THA | 24.900000 | 341.104820 | 85.234282 |
1159 | Micronesia (country) | FSM | 25.299999 | 0.296944 | NaN |
1462 | Romania | ROU | 25.600000 | 166.309355 | 95.950938 |
1331 | Oman | OMN | 25.600000 | 64.993498 | 96.810646 |
1105 | Malaysia | MYS | 25.700001 | 255.016609 | 80.759233 |
1559 | Serbia | SRB | 26.000000 | 41.819469 | 93.420276 |
1868 | United States | USA | 26.100000 | 14992.052727 | NaN |
1397 | Poland | POL | 26.100000 | 479.834179 | 95.984723 |
508 | Eswatini | SWZ | 26.200001 | 4.438778 | 64.199763 |
1152 | Mexico | MEX | 26.200001 | 1057.801296 | 90.859625 |
print("{} countries have missing female education value".format(len(top_40_countries) - len(top_40_countries.dropna())))
print("{} countries have missing female education value".format(len(bottom_40_countries) - len(bottom_40_countries.dropna())))
# drop countries with missing education data
top_40_countries = top_40_countries.dropna()
bottom_40_countries = bottom_40_countries.dropna()
6 countries have missing female education value 5 countries have missing female education value
As shown, there are 11 countries without female education data. That is, all of them do not have education data from 2008 to 2012.
Missing value techniques such as mean, hot-deck, and cold-deck imputation does not apply for our situation mostly because each country's education level cannot easily be compared or related to each other. For instance, using a global average value for missing country such as Vietnam would likely result in skewed result. Therefore, the best way would be to drop such countries from analysis and instead increase the sample size. Note that this tutorial originally analyzed 40 countries, but the sample size was increased to 80 to account for missing data.
Based on reports from global organizations such as United Nations International Children's Emergency Fund(UNICEF) and World Health Organization(WHO), young women, who are forcefully married to older husbands, are likely to experience domestic violence. Based on such findings, we would like to specifically observe the relationship between IPV and average female marriage age.
The following data from WHO is the percentage of women, who experienced intimate partner violence, by country: (https://srhr.org/vaw-data/data). Unfortunately, the page is a dynamically generated webpage, and it was not easy to crawl data; if you like to learn how to crawl from such pages, here is a helpful link: (https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_dynamic_websites.htm). Therefore, I collected the data manually by inserting each coutnry's data one at a time and saved it as CSV file.
ipv = pd.read_csv("ipv.csv", on_bad_lines='skip')
for index, row in top_40_countries.iterrows():
ipv_value = ipv.loc[ipv['Country Code'] == row.Code, 'IPV']
# only populate non missing data
if len(ipv_value.values) != 0:
top_40_countries.at[index, 'IPV'] = ipv_value.values[0]
for index, row in bottom_40_countries.iterrows():
ipv_value = ipv.loc[ipv['Country Code'] == row.Code, 'IPV']
if len(ipv_value.values) != 0:
bottom_40_countries.at[index, 'IPV'] = ipv_value.values[0]
top_40_countries
Entity | Code | MarriageAge | GDP | Education | IPV | |
---|---|---|---|---|---|---|
350 | Croatia | HRV | 27.100000 | 60.426019 | 101.479875 | 13.0 |
1589 | Slovakia | SVK | 27.200001 | 90.801178 | 92.627159 | 18.0 |
1424 | Portugal | PRT | 27.700001 | 238.113003 | 107.878590 | 18.0 |
93 | Bahamas | BHS | 27.900000 | 10.095760 | 89.653389 | NaN |
399 | Czechia | CZE | 27.900000 | 209.069941 | 95.394522 | 22.0 |
264 | Cayman Islands | CYM | 28.000000 | 4.156991 | 81.848961 | NaN |
499 | Estonia | EST | 28.000000 | 19.523477 | 105.432600 | 21.0 |
1268 | New Zealand | NZL | 28.200001 | 146.517541 | 122.132478 | 23.0 |
750 | Hungary | HUN | 28.299999 | 132.231134 | 96.114362 | 19.0 |
30 | Aruba | ABW | 28.500000 | 2.390503 | 101.445366 | NaN |
141 | Belize | BLZ | 28.600000 | 1.377177 | 78.050067 | 24.0 |
1133 | Malta | MLT | 28.600000 | 9.035932 | 96.253578 | 17.0 |
949 | Korea | KOR | 28.900000 | 1144.066965 | 96.175832 | 8.0 |
1616 | Slovenia | SVN | 29.000000 | 48.208240 | 98.524147 | 18.0 |
1434 | Puerto Rico | PRI | 29.200001 | 98.381300 | 86.932426 | NaN |
648 | Greece | GRC | 29.299999 | 297.124962 | 103.172908 | 18.0 |
135 | Belgium | BEL | 29.400000 | 481.420883 | 168.041019 | 22.0 |
1021 | Liechtenstein | LIE | 29.600000 | 5.082366 | 100.130093 | NaN |
1234 | Netherlands | NLD | 29.799999 | 847.380859 | 123.054074 | 21.0 |
1724 | Switzerland | CHE | 29.799999 | 603.434493 | 94.394710 | 12.0 |
291 | Chile | CHL | 29.900000 | 218.537551 | 91.442189 | 21.0 |
1842 | United Kingdom | GBR | 30.000000 | 2491.110093 | 97.737492 | 24.0 |
548 | Finland | FIN | 30.200001 | 249.424311 | 110.547345 | 23.0 |
1075 | Luxembourg | LUX | 30.200001 | 56.213986 | 101.570770 | 20.0 |
863 | Italy | ITA | 30.299999 | 2136.099955 | 101.594011 | 16.0 |
151 | Bermuda | BMU | 30.600000 | 6.634526 | 82.988594 | NaN |
568 | France | FRA | 30.700001 | 2645.187882 | 106.648727 | 22.0 |
1321 | Norway | NOR | 30.799999 | 428.757038 | 110.841403 | 20.0 |
1660 | Spain | ESP | 30.900000 | 1422.108200 | 120.583470 | 15.0 |
432 | Denmark | DNK | 31.200001 | 321.995279 | 121.352472 | 23.0 |
819 | Ireland | IRL | 31.299999 | 221.876011 | 118.526894 | 16.0 |
110 | Barbados | BRB | 31.900000 | 4.530000 | 104.351807 | NaN |
777 | Iceland | ISL | 32.400002 | 13.751162 | 108.613681 | 21.0 |
1698 | Sweden | SWE | 32.700001 | 495.812559 | 98.319490 | 21.0 |
bottom_40_countries
Entity | Code | MarriageAge | GDP | Education | IPV | |
---|---|---|---|---|---|---|
268 | Central African Republic | CAF | 17.299999 | 2.142591 | 11.464193 | 29.0 |
272 | Chad | TCD | 18.200001 | 10.668103 | 13.395092 | 29.0 |
216 | Burkina Faso | BFA | 19.500000 | 10.109619 | 19.264774 | 19.0 |
1100 | Malawi | MWI | 19.700001 | 6.959656 | 31.446800 | 30.0 |
410 | Democratic Republic of Congo | COD | 20.200001 | 21.565720 | 30.457006 | 47.0 |
1931 | Zimbabwe | ZWE | 20.600000 | 12.041655 | 50.154339 | 35.0 |
1570 | Sierra Leone | SLE | 21.000000 | 2.578026 | 37.286970 | 36.0 |
514 | Ethiopia | ETH | 21.200001 | 29.933790 | 31.734772 | 37.0 |
1 | Afghanistan | AFG | 21.500000 | 15.856679 | 31.832814 | 46.0 |
1349 | Panama | PAN | 21.600000 | 29.440288 | 72.731840 | 16.0 |
1552 | Senegal | SEN | 21.600000 | 16.121315 | 36.028431 | 24.0 |
928 | Kiribati | KIR | 21.600000 | 0.156120 | 91.443108 | 53.0 |
1675 | Sudan | SDN | 22.000000 | 58.962978 | 38.035416 | 17.0 |
229 | Cambodia | KHM | 22.000000 | 11.242275 | 41.599239 | 19.0 |
223 | Burundi | BDI | 22.100000 | 2.032135 | 20.830005 | 40.0 |
1679 | Suriname | SUR | 22.799999 | 4.368398 | 80.066982 | 28.0 |
1365 | Peru | PER | 23.100000 | 147.528937 | 92.318985 | 38.0 |
911 | Kazakhstan | KAZ | 23.100000 | 148.047348 | 100.571630 | 16.0 |
1343 | Palestine | PSE | 23.200001 | 9.681500 | 89.516943 | 29.0 |
1175 | Mongolia | MNG | 23.400000 | 7.189482 | 97.961914 | 27.0 |
1793 | Turkey | TUR | 23.799999 | 776.992600 | 81.419664 | 32.0 |
175 | Brazil | BRA | 23.900000 | 2208.838109 | 100.510277 | 23.0 |
327 | Costa Rica | CRI | 24.200001 | 37.658615 | 102.470738 | 27.0 |
1470 | Russia | RUS | 24.400000 | 1524.917468 | 87.992987 | NaN |
25 | Armenia | ARM | 24.500000 | 9.260285 | 97.644978 | 10.0 |
22 | Argentina | ARG | 24.600000 | 423.627422 | 103.883884 | 27.0 |
835 | Israel | ISR | 24.799999 | 234.654590 | 104.193341 | 6.0 |
1757 | Thailand | THA | 24.900000 | 341.104820 | 85.234282 | NaN |
1462 | Romania | ROU | 25.600000 | 166.309355 | 95.950938 | 18.0 |
1331 | Oman | OMN | 25.600000 | 64.993498 | 96.810646 | NaN |
1105 | Malaysia | MYS | 25.700001 | 255.016609 | 80.759233 | 19.0 |
1559 | Serbia | SRB | 26.000000 | 41.819469 | 93.420276 | 17.0 |
1397 | Poland | POL | 26.100000 | 479.834179 | 95.984723 | 13.0 |
508 | Eswatini | SWZ | 26.200001 | 4.438778 | 64.199763 | 18.0 |
1152 | Mexico | MEX | 26.200001 | 1057.801296 | 90.859625 | 24.0 |
print("{} countries have missing female education value".format(len(top_40_countries) - len(top_40_countries.dropna())))
print("{} countries have missing female education value".format(len(bottom_40_countries) - len(bottom_40_countries.dropna())))
top_40_countries.loc[top_40_countries['Code'] == 'BRB', 'IPV'] = 30
bottom_40_countries.loc[bottom_40_countries['Code'] == 'RUS', 'IPV'] = 21
bottom_40_countries.loc[bottom_40_countries['Code'] == 'THA', 'IPV'] = 41
# drop countries with missing IPV data
top_40_countries = top_40_countries.dropna()
bottom_40_countries = bottom_40_countries.dropna()
7 countries have missing female education value 3 countries have missing female education value
As shown, there are 10 countries without IPV data.
Out of missing value techniques, we would like to use cold-deck imputation by utilizing the following source. Data for Russia, Barbados, and Thailand are additionally found and inserted. We should not use mean or hot-deck imputation because each country's IPV rate would be drastically different due to many other factors. Thus, for values that we cannot find from the additional source, we will omit the corresponding countries from the analysis. As discussed above, this was another primary reason to increase the sample size.
# Scatter plot for top 40 countries
plt.figure(figsize = (15, 15))
ax = plt.subplot(111)
for index, row in top_40_countries.iterrows():
ax.scatter(row['GDP'], row['MarriageAge'], s=100)
ax.annotate(row.Entity, xy=(row['GDP'], row['MarriageAge']), xytext=(row['GDP'] - 100, row['MarriageAge'] + 0.1))
plt.title("Relationship between GDP and Marriage Age - Top 40")
plt.xlabel("GDP (billion $)")
plt.ylabel("Marriage Age")
plt.show()
# Scatter plot for bottom 40 countries
plt.figure(figsize = (15, 15))
ax = plt.subplot(111)
for index, row in bottom_40_countries.iterrows():
ax.scatter(row['GDP'], row['MarriageAge'], s=100)
ax.annotate(row.Entity, xy=(row['GDP'], row['MarriageAge']), xytext=(row['GDP'] - 100, row['MarriageAge'] + 0.1))
plt.title("Relationship between GDP and Marriage Age - Bottom 40")
plt.xlabel("GDP (billion $)")
plt.ylabel("Marriage Age")
plt.show()
Observations about relationship between GDP and marriage age
# Scatter plot for top 40 countries
plt.figure(figsize = (15, 15))
ax = plt.subplot(111)
for index, row in top_40_countries.iterrows():
if not math.isnan(row['Education']):
ax.scatter(row['Education'], row['MarriageAge'], s=100)
ax.annotate(row.Entity, xy=(row['Education'], row['MarriageAge']), xytext=(row['Education'] - 3.5, row['MarriageAge'] + 0.1))
plt.title("Relationship between Female Education and Marriage Age - Top 40")
plt.xlabel("Education")
plt.ylabel("Marriage Age")
plt.show()
# Scatter plot for bottom 40 countries
plt.figure(figsize = (15, 15))
ax = plt.subplot(111)
for index, row in bottom_40_countries.iterrows():
if not math.isnan(row['Education']):
ax.scatter(row['Education'], row['MarriageAge'], s=100)
ax.annotate(row.Entity, xy=(row['Education'], row['MarriageAge']), xytext=(row['Education'] - 3.5, row['MarriageAge'] + 0.1))
plt.title("Relationship between Female Education and Marriage Age - Bottom 40")
plt.xlabel("Education")
plt.ylabel("Marriage Age")
plt.show()
Observations about relationship between female education and marriage age
fig, ax = plt.subplots(1, 1, figsize=(20,20))
plt.title('IPV for Each Country')
fig = sns.barplot(y=ipv['Name'], x=ipv['IPV'])
# adding a line to represent mean IVP percentage
ax.axvline(ipv['IPV'].mean(), color="blue", linewidth=2)
plt.xlabel("IPV rate")
plt.ylabel("Country")
plt.show()
Observations about relationship between IPV and marriage age
from scipy.stats import pearsonr
# GDP
pearsons_r, p_value = pearsonr(bottom_40_countries['GDP'], bottom_40_countries['MarriageAge'])
print("Pearson's r and p value for bottom 40 countries' Marriage Age vs GDP: {}, {}".format(pearsons_r, p_value))
# Education
pearsons_r, p_value = pearsonr(bottom_40_countries['Education'], bottom_40_countries['MarriageAge'])
print("Pearson's r and p value for bottom 40 countries' Marriage Age vs Education: {}, {}".format(pearsons_r, p_value))
# IPV
pearsons_r, p_value = pearsonr(bottom_40_countries['IPV'], bottom_40_countries['MarriageAge'])
print("Pearson's r and p value for bottom 40 countries' Marriage Age vs IPV: {}, {}".format(pearsons_r, p_value))
Pearson's r and p value for bottom 40 countries' Marriage Age vs GDP: 0.36468455396131355, 0.033958557499972765 Pearson's r and p value for bottom 40 countries' Marriage Age vs Education: 0.8049608227420808, 9.526527603805122e-09 Pearson's r and p value for bottom 40 countries' Marriage Age vs IPV: -0.41926374066357386, 0.01358020237875942
Based on the above p-values, 0.034, 0, and 0.013 for GDP, Education, and IPV accordingly, we can reject null hypothesis because all of them fall under our cutoff which is 5%. In other words, for the bottom 40 countries, all three factors are highly related to women's average marriage age. Based on such findings, we would like to create and draw linear regression models as following to visually observe the fitting trend.
plt.figure(figsize = (5, 5))
sns.regplot(x = 'GDP', y = 'MarriageAge', data=top_40_countries)
plt.title("Relationship between GDP and Marriage Age - Bottom 40")
plt.xlabel("GDP (billion $)")
plt.ylabel("Marriage Age")
plt.show()
plt.figure(figsize = (5, 5))
sns.regplot(x = 'Education', y = 'MarriageAge', data=top_40_countries)
plt.title("Relationship between Female Education and Marriage Age - Bottom 40")
plt.xlabel("Education")
plt.ylabel("Marriage Age")
plt.show()
plt.figure(figsize = (5, 5))
sns.regplot(x = 'IPV', y = 'MarriageAge', data=top_40_countries)
plt.title("Relationship between Female IPV and Marriage Age - Bottom 40")
plt.xlabel("IPV")
plt.ylabel("Marriage Age")
plt.show()
By observing the graphs above, it is even more obvious that the data for all three criteria have strong linear relationship with our dependent variable, average marriage age.
Now, we would like to observe if our regression models for bottom countries fit the data for top countries. In order to accomplish this goal, we will render distribution plot for actual and expected average age for each contributing factor and analyze if they share similar mean value and shapes.
# GDP
lm = LinearRegression()
lm.fit(bottom_40_countries[['GDP']], bottom_40_countries['MarriageAge'])
predicted = lm.predict(top_40_countries[['GDP']])
f, ax = plt.subplots(figsize=(10,10))
plt.title('Actual and Predicted Marriage Age Distribution for GDP model')
sns.distplot(top_40_countries['MarriageAge'], hist=False, label="Actual", ax=ax)
sns.distplot(predicted, hist=False, label="Predictions", ax=ax)
plt.legend()
plt.show()
# Education
lm = LinearRegression()
lm.fit(bottom_40_countries[['Education']], bottom_40_countries['MarriageAge'])
predicted = lm.predict(top_40_countries[['Education']])
f, ax = plt.subplots(figsize=(10,10))
plt.title('Actual and Predicted Marriage Age Distribution for Education model')
sns.distplot(top_40_countries['MarriageAge'], hist=False, label="Actual", ax=ax)
sns.distplot(predicted, hist=False, label="Predictions", ax=ax)
plt.legend()
plt.show()
# IPV
lm = LinearRegression()
lm.fit(bottom_40_countries[['IPV']], bottom_40_countries['MarriageAge'])
predicted = lm.predict(top_40_countries[['IPV']])
f, ax = plt.subplots(figsize=(10,10))
plt.title('Actual and Predicted Marriage Age Distribution for IPV model')
sns.distplot(top_40_countries['MarriageAge'], hist=False, label="Actual", ax=ax)
sns.distplot(predicted, hist=False, label="Predictions", ax=ax)
plt.legend()
plt.show()
As indicated by above graphs, actual and expected marriage age distributions are drastically different for all three aspects. Among three, GDP model has the most similar spread of distribution but their mean values are quite different: 23 vs 30. For education and IPV, both mean value and spread are quite off for actual and expected values. Predicted values tend to have narrow spread around mean value of 24 and 25 whereas actual distribution has wider spread around mean value of 30. As a result, we can conclude that linear model of bottom countries does not fit into top countries well.
Since we observed that linear regression model of bottom countries does not fit data of top countries, we would like to see if linear regression is not a good model at all for top countries' data. We will evaluate our null hypothesis based on p-values once again.
# GDP
pearsons_r, p_value = pearsonr(top_40_countries['GDP'], top_40_countries['MarriageAge'])
print("Pearson's r and p value for top 40 countries' Marriage Age vs GDP: {}, {}".format(pearsons_r, p_value))
# Education
pearsons_r, p_value = pearsonr(top_40_countries['Education'], top_40_countries['MarriageAge'])
print("Pearson's r and p value for top 40 countries' Marriage Age vs Education: {}, {}".format(pearsons_r, p_value))
# IPV
pearsons_r, p_value = pearsonr(top_40_countries['IPV'], top_40_countries['MarriageAge'])
print("Pearson's r and p value for top 40 countries' Marriage Age vs IPV: {}, {}".format(pearsons_r, p_value))
Pearson's r and p value for top 40 countries' Marriage Age vs GDP: 0.23994458082697234, 0.21875549286589246 Pearson's r and p value for top 40 countries' Marriage Age vs Education: 0.19423195926348696, 0.3219815182448406 Pearson's r and p value for top 40 countries' Marriage Age vs IPV: 0.28131984532033316, 0.1469993294672017
Based on the above p-values, 0.218, 0.322, and 0.147 for GDP, Education, and IPV accordingly, we cannot reject null hypothesis because all of them are above our 5% cutoff. That is, all three factors are not related to women's average marriage age for top 40 countries.
In conclusion, GDP, education level, and IPV are deeply related to marriage age for bottom 40 countries. For the three contributing factors we chose for child marriage, further discussion about solution for each of them will decrease child marriage rates.
If women receive a high level of education and continue to pursue careers, it is likely for them to achieve personal goals and possibly postpone their marriages. In addition, if the country's economic level exceeds a certain point, families will not need marriage dowry. In order for women to pioneer their lives independently, governments should spread child marriage issues and put their best efforts in order to improve the situation.
As shown above, female education level impacts marriage age. One way to prevent child marriage is that government should regulate the required education period for both male and female. If it becomes the law and the culture of the country, children will receive all the required education. As a result, the total length of period for female children to stay in school will also be longer. Therefore, it is expected that the proportion of child marriage will decrease at the age of under 18.
In countries with low marriage age, the rate of Intimate Partner Violence tend to be higher. It is not adequate to generalize that the younger one gets married, the more domestic violence one gets, but the lower the age, the more likely it is to be an unwanted marriage because they are often forced by their parents. Although many countries are already preventing early marriage by law, there is still a widespread practice. Likewise, domestic violence will be prohibited by law, but the high IPV data represents that young married girls are likely to be outside the law. The way to improve this is to change perceptions. If more people learn that child marriage is a definite world problem, they will one day take it seriously even in the country where child marriage is prevalent.
Note that GDP, education, and IPV do not demonstrate relationship with average marriage age for top 40 countries. Similar to Happiness Index, this indicates that marriage age is not affected by such factors for countries with higher overall wealth. That is, once one has enough amount of finance, education, and low rate of violence, marriage age is possibly impacted by other factors. It would be exciting to find out what such aspects are.
Another interesting aspect is that global organizations have recently provided many reports that child marriage has increased for countries, which received relatively high COVID-19 impacts. Since not enough public data is available as of now, it is not easy to analyze the relationship between average marriage age and COVID-19 mortality or Intensive Care Unit (ICU) rates. However, when such data is available, it would be very fascinating to do similar research on such topic.