Do both government institutions and societal factors have an effect on the number of coronavirus cases per country?

Introduction

COVID-19, also known colloquially as the coronavirus, is an ongoing pandemic that has reached almost every part of the world, and is affecting all of our lives in some way. Different countries have been handling the pandemic in different ways, ranging from complete lockdowns (China), to letting the virus run its course (Sweden). In addition to handling the virus in varied ways, countries also have a wide range of healthcare systems. We were curious about what effect different factors from the government and society have on coronavirus cases.

In an attempt to measure these effects, we looked at the happiness scores, healthcare ranking, and coronavirus cases by country. Our data shows that both government and societal factors of a country have an effect on the number of coronavirus cases. Specifically, we found that there is a significant correlation between happiness scores and coronavirus cases, as well as between healthcare rankings and coronavirus cases. These findings are important because they show that there is not only one aspect of a country that plays a role in the coronavirus pandemic. In a broader relation to global health efforts, it is crucial to examine various components and to consider that there are external factors other than government intervention may contribute to a specific event.

To look at the government aspects, we looked at the relationship between the healthcare of a country and the number of cases of coronavirus in that country. The healthcare rankings were based on a study by the World Health Organization. The factors include: care process (preventative care measures, safe care, coordinated care, and engagement and patient preferences), access (affordability and timeliness), administrative efficiency, equity, and healthcare outcomes (population health, mortality amenable to healthcare, and disease-specific health outcomes).

To look at the societal factors, we looked at the relationship between the happiness scores of a country and the number of cases of coronavirus in that country. The happiness scores and rankings data came from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative.

Overall we were trying to determine whether the healthcare system of a country or the happiness level of a country truly impact what will happen during a pandemic, or is an external variable such as how society handles the pandemic more of an indicator. In order to see what effect happiness scores and healthcare ranking had on coronavirus cases, we made a linear regression for each combination of variables, as well as a multiple linear regression. If appropriate, we performed t-tests to check for significance (i.e to ensure the relationships established are not due to random chance) and to determine the linearity of the data set(s).

1. Original Data

In [30]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import math
import seaborn as sns

1a. Healthcare Rankings

In [31]:
healthcare = pd.read_csv("healthcare.csv")
healthcare.head()
Out[31]:
Country healthcareRank pop2020
0 France 1 65273.511
1 Italy 2 60461.826
2 San Marino 3 33.931
3 Andorra 4 77.265
4 Malta 5 441.543
  • Observations and attributes: The dataset has observations of countries (rows), and attributes (columns) for healthcare ranking and 2020 population size. The rankings are ordered in a way such that the country with the highest ranked highcare will have a ranking of 1.
  • Who funded/why created: The dataset was created by the World Population Review, which is an independent organization with no political affiliation, and has the goal of making demographic data more accessible through charts and visualizations. The “Best Healthcare In The World 2020” dataset was created using census data to obtain population sizes. Healthcare rankings were determined using the Legatum Institute's Prosperity Index.
  • What process was used: The Legatum Institute's Prosperity Index dataset only ranked 149 of the 195 countries. There is no information on the websites detailing why 46 countries were missing from the dataset. From observation it seems as if smaller and less developed countries were the ones that were omitted. This makes sense, because there may be less data about these areas, resulting in an inability to accurately predict qualities such as healthcare quality. Therefore, since the World Population Review took rankings from the Legatum dataset, these countries were not included in the dataset we used.
  • Preprocessing: The preprocessing for this dataset was very straightforward, as the data was easily downloaded to a csv. From there, we created a dataframe which included the country and ranking columns from the downloaded dataset.
  • Purpose of data: The data comes from the World Population Review which has the goal of making demographic data more accessible as in many cases this type of data is hidden behind spreadsheets or is hard to interpret by the public. The page where this dataset comes from also includes a map of the world with color coded countries by their healthcare rank. Therefore, this data had the purpose of educating the general public about healthcare around the world.

1b. Happiness Scores

In [32]:
happiness = pd.read_csv("2019happiness.csv")
happiness.head()
Out[32]:
Overall rank Country Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298
  • Observations & Attributes: The observations for the data set are countries, while the attributes include happiness score, GDP per capita, social support score, healthy life expectancy, freedom to make life choices, generosity, and perceptions of corruption.
  • Who funded/why created: Using data from Gallup World Poll and released by the United Nations on International Day of Happiness, this dataset was created to help quantify the state of global happiness.
  • What process was used: The questions were asked in a survey by Gallup World Poll; the fact that a survey was used suggests that the data that was recorded were only from individuals that responded to the survey. Depending on who answered the survey (i.e. if they were individuals from different social demographics), the answers may vary based on the life evaluations made by these individuals. If we are assuming that this data is unbiased and that the individuals who answered these questions vary and have different social, economic and political backgrounds, it may suggest that this data is comprehensive. However, there may be the existence of response bias, as only those who answered the survey will contribute to the survey.
  • Preprocessing: This dataset was downloaded from Kaggle and came as a csv file; much of the preprocessing involved aligning the countries from this data set with other data sets by filtering out excess countries and extracting specific columns from this data set.
  • Purpose of data: Since this data set was published on Kaggle, it is evident that the purpose of this data was to be used to establish some standard for happiness around the world. As this dataset was derived from a survey, it is evident that people were aware of the data collection as they were specifically asked questions regarding happiness (on a scale of 1-10). Since this survey was conducted by Gallup polls, people most likely expected the data to be used as a metric for life evaluation, although they may have not associated the life evaluation questions to happiness and other attributes.

1c. Coronavirus Cases

In [33]:
covid_cases = pd.read_csv("COVID-19Cases.csv")
covid_cases.head()
Out[33]:
Country,\nOther Total\nCases New\nCases Total\nDeaths New\nDeaths Total\nRecovered Active\nCases Serious,\nCritical Tot Cases/\n1M pop Deaths/\n1M pop Total\nTests Tests/\n1M pop
0 Afghanistan 1,463 112 47 4.0 188 1,228 7 38 1 7,425 191
1 Albania 712 34 27 NaN 403 282 4 247 9 7,015 2,438
2 Algeria 3,256 129 419 4.0 1,479 1,358 40 74 10 6,500 148
3 Andorra 731 NaN 40 NaN 344 347 17 9,461 518 1,673 21,653
4 Angola 25 NaN 2 NaN 6 17 NaN 0.8 0.06 NaN NaN
  • Observations and attributes: The dataset has observations of countries (rows). The columns are total cases, new cases, total deaths, total recovered, active cases, serious/critical, deaths per 1 million pop, total cases per 1 million pop, total tests, tests per 1 million pop
  • Who funded/why created: This data was created with the goal of making statistics available worldwide, and to inform people about where the coronavirus is. This data was funded by World o Meter which is an international team of developers, researchers, and volunteers. They have no political, governmental, or corporate affiliation.
  • What process was used: World o Meter says they collect their data from official reports from governments around the world. They provide the source of each data update in the “Latest Updates” (News) section. Since most of this work is done by analysts and researchers who validate the data, it is not immune to human error. They claim that since national aggregates often lag behind local and regional health departments data, their team of people monitors daily reports from local authorities.
  • Preprocessing: We had to take out commas because otherwise they could not be floats. We changed all the values from string to ints or floats. We also had to take out spaces before and after each value so we could use them as floats
  • Purpose of data: There are people involved since the counts are mainly how many people have COVID-19, but the data is anonymous as there are not individual names attached to anything. It is not clear that the people were asked explicitly for their data to be included, however it is somewhat assumed that during this pandemic all cases will be reported to authorities to track the spread of the virus.
In [34]:
# calculates the standard error of residuals
def standard_error(x,y,pred):
    x_mean = x.mean()
    n = len(x)
    sum_x = np.sum((x-x_mean)**2)
    sum_y = np.sum((y-pred)**2) 
    se = (sum_y/((n-2)*sum_x))**(1/2)
    return se

2. Relationships Between Each Pair of Datasets

In section 2, we look at the relationship between each pair of datasets. This will give us a better idea about how these variables relate in a larger sample size before we combine all three factors into one dataframe for further analysis

2a. Corona Cases and Healthcare Rankings

In [35]:
# making data frame
covid_health = pd.DataFrame(healthcare["Country"])
covid_health["Healthcare ranking"] = healthcare["healthcareRank"]
covidcases = []
for x in covid_health["Country"]:
    cases = covid_cases[covid_cases["Country,\nOther"]==x]
    covidcases.append(cases["Tot Cases/\n1M pop"].to_string(index=False))
covid_health["Cases Per Million"] = covidcases
new_covid_health = covid_health[covid_health["Cases Per Million"]!="Series([], )"]


## making all cases into floats
cases = []
for y in new_covid_health["Cases Per Million"]:
    cases.append(y.replace(',',''))
new_covid_health["Cases Per Million"] = cases
new_covid_health.loc['Cases Per Million'] = new_covid_health['Cases Per Million'].astype('float')

## making all rankings into floats
new_covid_health.loc['Healthcare ranking'] = new_covid_health['Healthcare ranking'].astype('float')

    
## taking out Na/N data 

new_covid_health = new_covid_health.dropna(subset = ["Cases Per Million"], inplace = False)
new_covid_health.head()
/usr/local/lib/python3.6/dist-packages/ipykernel/__main__.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py:671: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
Out[35]:
Country Healthcare ranking Cases Per Million
0 France 1.0 2449
1 Italy 2.0 3231
2 San Marino 3.0 15119
3 Andorra 4.0 9461
4 Malta 5.0 1015
In [36]:
## making scatterplot 
plt.figure(figsize = (5,5))
plt.scatter(new_covid_health['Healthcare ranking'].astype("float"),new_covid_health['Cases Per Million'].astype("float"))
plt.xlim(0,120)
plt.ylim(0,7000)
plt.ylabel('cases per million')
plt.xlabel('healthcare ranking')
plt.title("Healthcare ranking and coronavirus cases per million ")
plt.yticks()
plt.show()
Out[36]:

Based on the scatter plot, it appears that there is a negative correlation between healthcare ranking and coronavirus cases, meaning as healthcare ranking gets higher, the country sees less cases.

In [37]:
xarray = np.array(new_covid_health["Healthcare ranking"].astype("float")) #healthcare ranking
yarray = np.array(new_covid_health["Cases Per Million"].astype("float")) #cases 
plt.figure(figsize = (5,5))
sns.residplot(xarray, yarray, lowess=False, color="b")
plt.title("Residual plot for corona cases and healthcare ranking")
plt.show()
Out[37]:

There appears to be some pattern to the residual plot, which suggests that a linear regression is not likely to be appropriate.

In [25]:
#correlation
correlation = np.corrcoef(xarray, yarray)[0,1]
print("Correlation coefficient: ", correlation)
print("coefficeint of determinism: ",correlation**2)
Correlation coefficient:  -0.5048220043533121
coefficeint of determinism:  0.25484525607929553

A correlation coeffecient of -.5 suggests there is a pretty strong negative realtionship between the two variables the coefficient of determinism is .25, meaning 25% of the variability can be explained by the model. However since the residuals are not random, we decided to take the log of the coronavirus cases to make a linear model. this is appropriate because cases are increasing exponentially

Healthcare Ranking and Log of Coronavirus Cases

In [26]:
#scatterplot

yarraylog = np.log(yarray) #taking the log of cases 

plt.figure(figsize=(5,5))
plt.scatter(xarray, yarraylog)
plt.xlim(0,120)
plt.ylim(0,30)
plt.xlabel("Healthcare ranking Score")
plt.ylabel("Cases/1M Population (Log)")
plt.title("Helathcare Ranking compared to the Log of Coronavirus Cases")
plt.show()
Out[26]:

Based on the scatterplot, it appears that there is a relatively strong negative relationship between healthcare score and the log of coronavirus cases and that a linear relationship better describes this data than the original.

In [27]:
plt.figure(figsize = (5,5))
sns.residplot(xarray, yarraylog, lowess=False, color="b")
plt.title("Residual plot for log of corona cases and healthcare ranking")
plt.show()
Out[27]:

This residual plot is much more random than with the original data, suggesting a linear model is more appropriate

In [36]:
#linear regression
newmodel = LinearRegression().fit(new_covid_health[['Healthcare ranking']].astype("float"),np.log(new_covid_health[['Cases Per Million']].astype("float")))
print("Linear Regression for Healthcare ranking and Log of Coronavirus Cases: ")
print("Regression Slope: ", newmodel.coef_[0])
print("Regression Intercept: ", newmodel.intercept_)
print("Coefficient of determination: ", newmodel.score(new_covid_health[['Healthcare ranking']].astype("float"),np.log(new_covid_health[['Cases Per Million']].astype("float"))))
health = np.array(new_covid_health['Healthcare ranking'].astype("float"))
correlation = np.corrcoef(health, yarraylog)[0,1]
print("Correlation Coefficient: ", correlation)
Linear Regression for Healthcare ranking and Log of Coronavirus Cases: 
Regression Slope:  [-0.03859679]
Regression Intercept:  [7.6741454]
Coefficient of determination:  0.3860641720895487
Correlation Coefficient:  -0.6213406248504508
$$\hat{y} = -0.0385*(x) + 7.6741$$
  • where y is log of coronavirus cases and x is the healthcare ranking of a country
  • Based on the linear regression model established between healthcare ranking and the log of coronavirus cases:
    • The regression slope value of -.0385 indicates for every additional increase in healthcare ranking, the country has -.038 less coronavirus cases(per 1M population).
    • The regression intercept of 7.67 is hard to interpret because a log regression assumes a continous random variable but we do not have that since healthcare rankings are not continous. also there is no healthcare ranking of zero, which is a limitation of our model.
    • The coefficient of determination showcases the amount of variation in the dependent variable that can be predicted by the independent variable. A value of 0.386 means that 38.6% of the variation in the coronavirus cases can be predicted by the linear regression modeled with helathcare ranking.
    • the correlation coefficent of -.62 suggests a pretty strong negative relationship
    • We then did a hypothesis test to see how significant our findings were
    • H0: There is no relationships between healthcare ranking and log of coronavirus cases
    • H1: There is a relationship between healthcare ranking and log of coronavirus cases
In [37]:
#hypothesis test for log corona and healthcare ranking
slope = newmodel.coef_[0]
intercept = newmodel.intercept_
m = float(slope)
b = float(intercept)
df = len(xarray)-2
## finding standard error
preds = m*xarray +b
se = standard_error(xarray,yarraylog,preds)

t = m/se # test statistic 
p = stats.t.cdf(t,df=df)
print("t statistic: ", t)
print("standard error: ", se)
print("degrees of freedom: ", df)
print("p value for signiface level .05: ", p)
t statistic:  -7.606102509564342
standard error:  0.005074450845059108
degrees of freedom:  92
p value for signiface level .05:  1.1778699659369838e-11
  • since p value is less than significance level, reject the null hypothesis. there is evidence that there is a linear relationship between log of corona cases and heatlhcare ranking
  • next we looked at happiness scores and corona cases
Summary

In the original data, which has 94 data points, the scatterplot and correlation coefficient (-.50) suggest a negative correlation between the two variables. This means that countries with higher healthcare rankings, meaning worse healthcare systems, have less coronavirus cases. The scatter plot for residuals for the original data appeared to have some sort of pattern and didn’t appear to be random, so we did not perform a hypothesis test on it. We decided to take the log of the coronavirus cases because the cases are increasing exponentially in some places. After taking the log of the coronavirus cases, the residuals appeared much more random. We performed a t-test on the log of coronavirus cases, and found a p value of 1.17 x 10-11, which is less than our significance level of .05. This suggests that there is strong evidence that the relationship between coronavirus cases and healthcare ranking is not due to random chance.

2b. Happiness Scores and Corona Cases

In [6]:
#creating the dataframe for happiness scores and coronavirus cases 

happiness_covid = pd.DataFrame(happiness['Country'])
happiness_covid["Happiness Score"] = happiness["Score"]
cases = []
for country in happiness_covid["Country"]:
    row = covid_cases[covid_cases["Country,\nOther"]==country]
    cases.append(row["Tot Cases/\n1M pop"].to_string(index=False))

happiness_covid["Cases"] = cases
happiness_covid_filtered = happiness_covid[happiness_covid["Cases"]!='Series([], )']
happiness_covid_filtered.head()
Out[6]:
Country Happiness Score Cases
0 Finland 7.769 808
1 Denmark 7.600 1,458
2 Norway 7.554 1,382
3 Iceland 7.494 5,246
4 Netherlands 7.488 2,170
In [8]:
#scatterplot for coronavirus and happiness score

happiness_score = happiness_covid_filtered['Happiness Score']

cases_covid = []
cases_filtered = []
for y in happiness_covid_filtered["Cases"]:
    cases_covid.append(y.replace(' ', ''))

for x in cases_covid:
    cases_filtered.append(x.replace(',',''))


happiness_covid_filtered.loc[:, "Cases"] = cases_filtered
happiness_covid_filtered['Cases'].astype("float")

cases = happiness_covid_filtered['Cases'].astype("float")
plt.figure(figsize=(5, 5))
plt.ylabel("Coronavirus Cases (Total Cases/1M population)")
plt.xlabel("Happiness Score")
plt.title("Coronavirus Cases and Happiness Scores, by Country")
plt.scatter(happiness_score,cases)
plt.show()
Out[8]:

Based on the scatterplot, it appears that a positive correlation between happiness score and coronavirus cases, which means that for an increase in happiness score, there is an increase in coronavirus cases.

In [9]:
#residual plot 
covid_cases =  happiness_covid_filtered["Cases"].astype("float")
plt.figure(figsize=(5, 5))
sns.residplot(happiness_score, covid_cases, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Happiness Score')
plt.title("Residuals for Happiness Score vs. Coronavirus Cases")
plt.show()
Out[9]:

There appears to be some pattern to the residual plot, which suggests that a linear regression is not likely to be appropriate.

In [10]:
#additional values

#correlation
happy = np.array(happiness_covid_filtered["Happiness Score"])
case = np.array(happiness_covid_filtered["Cases"].astype(float))
correlation = np.corrcoef(happy, case)[0,1]
print("Correlation coefficient: ", correlation)
Correlation coefficient:  0.5714228055500767

A correlation coefficient of 0.57 suggests that there is a relatively strong positive relationship between coronavirus cases and happiness scores.

Summary

In the original data, where the data was at a length of 143, the scatterplot and the correlation coefficient (+0.57) indicate that a positive correlation exists between happiness scores and coronavirus cases. Analyzed in a wider context, it suggests that increases in happiness scores within countries lead to more coronavirus cases. One reason to explain such findings can be that countries with higher happiness scores are more readily available and have more access to resources to test and report instances of COVID-19. This may explain why happier countries have exponentially higher cases, in comparison to less happier countries.

Happiness Scores and Log of Coronavirus Cases

In [11]:
#scatterplot
cases = happiness_covid_filtered["Cases"].astype("float")
cases = np.log(cases)

plt.figure(figsize=(5,5))
plt.scatter(happiness_covid_filtered["Happiness Score"], cases)
plt.xlabel("Happiness Score")
plt.ylabel("Cases/1M Population (Log)")
plt.title("Happiness Score compared to the Log of Coronavirus Cases")
plt.show()
Out[11]:

Based on the scatterplot, it appears that there is a relatively strong positive relationship between happiness score and the log of coronavirus cases and that a linear relationship better describes this data than the original.

In [12]:
#residual plot
plt.figure(figsize=(5, 5))
sns.residplot(happiness_score, cases, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Happiness Score')
plt.title("Residuals for Happiness Score vs. Log of Coronavirus Cases")
plt.show()
Out[12]:

The residual plot indicates little to no pattern, which suggests that a linear regression is appropriate.

In [13]:
#linear regression
logmodel = LinearRegression().fit(happiness_covid_filtered[["Happiness Score"]], cases)
print("Linear Regression for Happiness Score and Log of Coronavirus Cases: ")
print("Regression Slope: ", logmodel.coef_[0])
print("Regression Intercept: ", logmodel.intercept_)
print("Coefficient of determination: ", logmodel.score(happiness_covid_filtered[["Happiness Score"]], cases))
Linear Regression for Happiness Score and Log of Coronavirus Cases: 
Regression Slope:  1.635527152658713
Regression Intercept:  -4.372904374039632
Coefficient of determination:  0.5666347150674728
  • Based on the linear regression model established between happiness score and the log of coronavirus cases:
    • The regression slope value of 1.64 indicates for every additional increase in happiness score, the country has 1.64x more coronavirus cases(per 1M population).
    • The regression intercept of -4.37 suggests that at a happiness score of 0, there is essentially a negative number of coronavirus cases. In a real-life context, this appears to be impossible because there cannot be negative cases for a disease.
    • The coefficient of determination showcases the amount of variation in the dependent variable that can be predicted by the independent variable. A value of 0.57 means that 57% of the variation in the coronavirus cases can be predicted by the linear regression modeled with happiness score.

Using the model of y = mx+b, the regression line that predicts coronavirus cases (log) from happiness scores is: $$\hat{y} = 1.64*(x) - 4.37$$

with y as log of coronavirus cases and x as happiness score.

In [14]:
#hypothesis test
m = logmodel.coef_[0]
b = logmodel.intercept_
pred = m*happy + b
se = standard_error(happy, cases, pred)

#t-statistic
t = m/se
df = len(happy)-2

p = 1-stats.t.cdf(t,df=df)

#printing the values
print("The t-statistic: ", t)
print("The p-value: ", p)
The t-statistic:  13.577946276765257
The p-value:  0.0

H0: There is no relationships between happiness scores and log of coronavirus cases

H1: There is a relationship between happiness scores and log of coronavirus cases

Since the p-value, which is essentially 0, is less than significance level of 0.05, we reject the null hypothesis. This means that there is statistically significant evidence of a linear relationship between happiness scores and the log of coronavirus cases, and that our findings are not due to random chance.

In [15]:
#correlation
happy = np.array(happiness_covid_filtered["Happiness Score"])
correlation = np.corrcoef(happy, cases)[0,1]
print("Correlation Coefficient: ", correlation)
Correlation Coefficient:  0.7527514298010157

A correlation coefficient of 0.75 suggests that there is a strong positive relationship between coronavirus cases and happiness scores. Compared to the correlation established between the original data without taking the log of coronavirus cases, this correlation coefficient suggests of a stronger positive linear relationship.

Summary

With the original data, we also explored the relationship between happiness scores and the log of coronavirus cases. From the scatterplot and correlation coefficient (.75), it seems that there is a relatively strong positive relationship between the two variables. An explanation for such a finding can be that the original data (without the log) was exponentially correlated to one another, which means that for every additional increase in happiness score, there was an exponential increase in coronavirus cases. Taking the log of coronavirus cases and comparing that to happiness score transforms the data relationship to a more linear relationship. When analyzing the data with the log of coronavirus cases, the coefficient of determination was .567, which means that there nearly 57% more of the variation in coronavirus cases (log model) could be explained by happiness scores for the country. Additionally, the residual plot went from showcasing a pattern to showcasing no pattern, which suggests that a linear relationship can be used when comparing happiness scores with the log of coronavirus cases. From the hypothesis testing of the linear regression, the p-value, essentially 0, was less than the significance level of 0.05, which suggests that our findings of the relationships between happiness score and coronavirus cases are statistically significant and not likely the consequence of random chance.

2c. Healthcare Rankings and Happiness Scores

In [9]:
# combine healthcare and happiness dataframes
health_happiness = pd.DataFrame(healthcare['Country'])
health_happiness["Healthcare Ranking"] = healthcare["healthcareRank"]
scores = []
for country in health_happiness["Country"]:
    row = happiness[happiness["Country"]==country]
    scores.append(row["Score"].to_string(index=False))
health_happiness["Happiness Score"] = scores
health_happiness_filtered = health_happiness[health_happiness["Happiness Score"]!='Series([], )']
health_happiness_filtered.head()
Out[9]:
Country Healthcare Ranking Happiness Score
0 France 1 6.592
1 Italy 2 6.223
4 Malta 5 6.726
5 Singapore 6 6.262
6 Spain 7 6.354
In [7]:
# scatterplot for healthcare ranking and happiness score
healthcare_ranking = health_happiness_filtered['Healthcare Ranking']
happiness_scores = health_happiness_filtered['Happiness Score'].astype("float")
plt.figure(figsize=(5,5))
plt.ylabel("Happiness Score")
plt.xlabel("Healthcare Ranking")
plt.title("Happiness Score and Healthcare Ranking, by Country")
plt.scatter(healthcare_ranking,happiness_scores)
plt.yticks
plt.show()
Out[7]:

Based on the scatterplot, it appears that there is a relatively weak negative relationship between healthcare rankings and happiness scores, meaning that as healthcare ranking increases, happiness scores decrease. This makes sense since larger healthcare rankings correspond to worse healthcare. It is likely that countries with worse healthcare also have lower qualities of life (happiness scores).

In [22]:
# residual plot
happy_array = np.array(health_happiness_filtered["Happiness Score"].astype("float"))
health_array = np.array(health_happiness_filtered["Healthcare Ranking"])
plt.figure(figsize=(5,5))
sns.residplot(health_array, happy_array, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Healthcare Ranking')
plt.title("Residuals for Healthcare Ranking vs. Happiness Score")
plt.show()
Out[22]:

The residual plot shows no pattern, demonstrating that a linear regression model is plausible in this case.

In [44]:
# create linear moded + analyze summary stats
healthcare = health_happiness_filtered[["Healthcare Ranking"]] 
happiness = health_happiness_filtered[["Happiness Score"]] 

model = LinearRegression().fit(healthcare, happiness)
print("Linear Regression for Healthcare Ranking and Happiness Score: ")
slope = model.coef_[0]
print("Regression Slope: ", model.coef_[0][0])
intercept = model.intercept_
print("Regression Intercept: ", model.intercept_[0])
r = model.score(healthcare, happiness)
print("Correlation Coefficient: ", r)
print("Coefficient of Determinisim: ", r**2)
Linear Regression for Healthcare Ranking and Happiness Score: 
Regression Slope:  -0.01995010580686265
Regression Intercept:  7.031031966537375
Correlation Coefficient:  0.3631702647850431
Coefficient of Determinisim:  0.13189264122403832

Using the model of y = mx+b, the regression line that predicts happiness scores from healthcare rankings is: $$\hat{y} = -0.01995*(x) + 7.03103$$

with y as happiness score and x as healthcare ranking.

  • Based on the linear regression model established between happiness scores and healthcare rankings:
    • The regression slope value of -.01995 indicates for every additional increase in healthcare ranking, the happiness score of the country decreases by .01995.
    • The regression intercept of 7.03103 suggests that when the healthcare ranking of a country is 0, the happiness score will be 7.03. In a real-life context, this means that the country with the best healthcare is predicted to have a happiness score or 7.03. However, there is no real life healthcare ranking of 0, since the lowest ranking stops at 1. Therefore, that is a limintation to the model that we used. In this case, healthcare rankings are not discrete (span from negative to positive infinity), therefore we must be cautious with some of the conclusions drawn from the linear regression.
    • The coefficient of determination showcases the amount of variation in the dependent variable that can be predicted by the independent variable. A value of 0.1319 means that 13.19% of the variation in happiness scores can be predicted by the linear regression modeled with healthcare ranking.
    • the correlation coefficent of .36 suggests a moderate positive relationship between healthcare rankings and happiness scores.
In [45]:
# hypothesis test for happiness score and healthcare ranking
happiness = health_happiness_filtered["Happiness Score"].astype("float")
healthcare = health_happiness_filtered["Healthcare Ranking"].astype("float")

inter = model.intercept_[0]
slope = model.coef_[0][0]

# finding standard error
pred = slope*healthcare + inter
df = len(healthcare)-2
se = standard_error(healthcare, happiness, pred)

t = slope/se # t-statistic
p =  stats.t.cdf(t,df=df) # p value given t statistic and degrees of freedom
print("t statistic:", t)
print("standard error:", se)
print("degrees of freedom:", df)
print("p value for significance level .05: ", float(p))
t statistic: -6.583400417945623
standard error: 0.0030303649391400928
degrees of freedom: 76
p value for signaficance level .05:  2.646221819964587e-09
  • H0: There is no relationships between healthcare ranking and happiness scores
  • H1: There is a relationship between healthcare ranking and happiness scores
  • Since the p value is less than significance level, reject the null hypothesis. There is evidence that there is a linear relationship between happiness scoress and healthcare ranking
Summary

In the original data, which has 78 data points, the scatterplot and correlation coefficient (.36) suggest a positive correlation between the two variables. This means that countries with higher healthcare rankings (meaning worse healthcare systems) have lower happiness scores. Because the residuals were random, we were able to create a linear model. The intercept of the model was −0.01995 with a slope of 7.03. This means for every additional increase in healthcare ranking, there is a −0.01995 decrease in happiness score. We performed a t-test with the data and found a p value of 2.65 x 10-9, which is less than our significance level of .05. This suggests that there is strong evidence that the relationship between happiness scores and healthcare rankings is not due to random chance.

3. Combine Three Datasets into One Dataframe

In this section, we are combining all three datasets into one dataframe. This will allow us to create a multiple linear regression, as well as look at the relationships between each pair of datasets with a consistent set of countries.

3a. Multiple Linear Regression

In [13]:
### multiple linear regression
combine = pd.DataFrame(healthcare["Country"])
combine.loc[:, "Healthcare Ranking"] = healthcare['healthcareRank']
cases = []
casesh = []
happiness_rank = []
for country in combine["Country"]:
    row = covid_cases[covid_cases["Country,\nOther"]==country]
    cases.append(row["Tot Cases/\n1M pop"].to_string(index=False))
    rowh = happiness[happiness["Country"]==country]
    casesh.append(rowh["Score"].to_string(index=False))
combine.loc[:, "Cases"] = cases
combine.loc[:, "Happiness Score"] = casesh
combine_filtered = combine[combine["Happiness Score"]!="Series([], )"]
combine_filtered_1 = combine_filtered[combine_filtered["Cases"]!="Series([], )"]

cases_covid = []
cases_filtered = []
happiness_scores = []

for y in combine_filtered_1["Cases"]:
    cases_covid.append(y.replace(' ', ''))

for x in cases_covid:
    cases_filtered.append(x.replace(',',''))

for a in combine_filtered_1["Happiness Score"]:
    happiness_scores.append(a.replace(' ', ''))

combine_filtered_1.loc[:, "Cases"] = cases_filtered
combine_filtered_1.loc[:, "Happiness Score"] = happiness_scores

combine_filtered_1.head()
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py:966: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
Out[13]:
Country Healthcare Ranking Cases Happiness Score
0 France 1 2449 6.592
1 Italy 2 3231 6.223
4 Malta 5 1015 6.726
5 Singapore 6 2170 6.262
6 Spain 7 4786 6.354
In [14]:
# scatterplot with all three variables
plt.figure(figsize=(5,5))
plt.ylabel("Coronavirus Cases")
plt.xlabel("Healthcare Ranking +")
plt.title("Happiness Score and Healthcare Ranking, by Country")
case_health = plt.scatter(combine_filtered_1["Healthcare Ranking"].astype("float"),combine_filtered_1["Cases"].astype("float"))
case_happy = plt.scatter(combine_filtered_1["Happiness Score"].astype("float"),combine_filtered_1["Cases"].astype("float"))
plt.legend((case_health, case_happy),("Healthcare Ranking", "Happiness Score"))
plt.yticks
plt.show()
Out[14]:
In [15]:
#multiple regression coefficients
model_combined = LinearRegression().fit(combine_filtered_1[["Healthcare Ranking", "Happiness Score"]].astype("float"), combine_filtered_1["Cases"].astype("float"))
print("Predicting Coronavirus cases, based on healthcare ranking and happiness score: ")
print("")
print("Regression slope for healthcare ranking: ", (model_combined.coef_)[0])
print("Regression slope for Happiness score: ", (model_combined.coef_)[1])
Predicting Coronavirus cases, based on healthcare ranking and happiness score: 

Regression slope for healthcare ranking:  -18.8605614812662
Regression slope for Happiness score:  409.00488339161524
  • The regression slope of -18.86 suggests that for every additional increase in healthcare ranking for the country, the country will have 18.86 fewer coronavirus cases.
  • The regression slope of 409 suggests that for every additional increase in happiness score for the country, the country will have 409 more coronavirus cases.

3b. Comparing Happiness and Corona Cases Using Joined Data

In [29]:
#scatterplot
plt.figure(figsize=(5,5))
plt.scatter(combine_filtered_1["Happiness Score"].astype("float"),combine_filtered_1["Cases"].astype("float"))
plt.title("Coronavirus Cases and Happiness Scores")
plt.ylabel("Coronavirus Cases/1M population")
plt.xlabel("Happiness Score")
plt.show()
Out[29]:

Based on the scatterplot, there appears to be a positive relationship between coronavirus cases and happiness scores.

In [30]:
#residual plot
plt.figure(figsize=(5, 5))
sns.residplot(combine_filtered_1["Happiness Score"].astype("float"), combine_filtered_1["Cases"].astype("float"), lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Happiness Score')
plt.title("Residuals for Happiness Score vs. Coronavirus Cases")
plt.show()
Out[30]:

Similar to the original data, there appears to be some pattern to the residuals, which indicate that a linear model may not be appropriate.

In [31]:
#additional values 

#correlation coefficient
happy_array = np.array(combine_filtered_1["Happiness Score"].astype(float))
case_array = np.array(combine_filtered_1["Cases"].astype(float))
correlation = np.corrcoef(happy_array, case_array)[0,1]
print("Correlation Coefficient: ", correlation)
Correlation Coefficient:  0.5225223741112006

A correlation coefficient of 0.52 suggests that there is a relatively strong positive relationship between coronavirus cases and happiness scores, which reflects what was found in the original data.

Summary

For the joined data, the length of the data was shortened to 77; from this merged data (will all three variables present), the scatterplot and the correlation coefficient (+0.52) suggests similar findings to the original data. While there may be a slight discrepancy in the values, it is evident that for additional increases in happiness score, there are increases in coronavirus cases.

Happiness Scores and Log of Coronavirus Cases

In [32]:
#scatterplot 
cases_comb = combine_filtered_1["Cases"].astype("float")
cases_comb = np.log(cases_comb)
happy_comb = combine_filtered_1["Happiness Score"].astype(float)

plt.figure(figsize=(5,5))
plt.scatter(happy_comb, cases_comb)
plt.xlabel("Happiness Score")
plt.ylabel("Cases/1M Population (Log)")
plt.title("Happiness Score compared to the Log of Coronavirus Cases")
plt.show()
Out[32]:

Based on the scatterplot, there appears to be a relatively strong positive relationship between happiness score and the log of coronavirus cases. Additionally, there appears to be more of a linear trend, as opposed to an exponential trend (before taking the log of COVID-cases).

In [33]:
#residual plot
plt.figure(figsize=(5, 5))
sns.residplot(happy_comb, cases_comb, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Happiness Score')
plt.title("Residuals for Happiness Score vs. Log of Coronavirus Cases")
plt.show()
Out[33]:

There appears to be no pattern to the residual plot, which indicates that a linear model may be appropriate.

In [34]:
#linear regression
happy_resize = combine_filtered_1[["Happiness Score"]].astype(float)
log_model = LinearRegression().fit(happy_resize, cases_comb)
print("Linear Regression for Happiness Score and Log of Coronavirus Cases: ")
print("Regression Slope: ", log_model.coef_[0])
print("Regression Intercept: ", log_model.intercept_)
print("Coefficient of determination: ", log_model.score(happy_resize, cases_comb))
Linear Regression for Happiness Score and Log of Coronavirus Cases: 
Regression Slope:  1.1748287720344996
Regression Intercept:  -1.299543384194882
Coefficient of determination:  0.35506233376623897
  • Based on the linear regression model established between happiness score and the log of coronavirus cases:
    • The regression slope value of 1.17 indicates for every additional increase in happiness score, the country has 1.17x more coronavirus cases(per 1M population).
    • The regression intercept of -1.30 suggests that at a happiness score of 0, there is essentially a negative number of coronavirus cases. In a real-life context, this appears to be impossible because there cannot be negative cases for a disease.
    • The coefficient of determination showcases the amount of variation in the dependent variable that can be predicted by the independent variable. A value of 0.355 means that 35.5% of the variation in the coronavirus cases can be predicted by the linear regression modeled with happiness score.

Using the model of y = mx+b, the regression line that predicts coronavirus cases (log) from happiness scores is: $$\hat{y} = 1.17*(x) - 1.30$$

with y as log of coronavirus cases and x as happiness score.

In [35]:
#hypothesis test
m = log_model.coef_[0]
b = log_model.intercept_
pred = m*happy_array + b
se = standard_error(happy_array, cases_comb, pred)

#t-statistic
t = m/se
df = len(happy_array)-2

p = 1-stats.t.cdf(t,df=df)

#printing the values
print("The t-statistic: ", t)
print("The p-value: ", p)
The t-statistic:  6.425753515116964
The p-value:  5.419756421431998e-09

H0: There is no relationships between happiness scores and log of coronavirus cases

H1: There is a relationship between happiness scores and log of coronavirus cases

Since the p-value, which is essentially 0, is less than significance level of 0.05, we reject the null hypothesis. This means that there is statistically significant evidence of a linear relationship between happiness scores and the log of coronavirus cases, and that our findings are not due to random chance.

In [19]:
#additional values 

#correlation coefficient
happy = np.array(combine_filtered_1["Happiness Score"].astype(float))
correlation = np.corrcoef(happy, cases_comb)[0,1]
print("Correlation Coefficient: ", correlation)
Correlation Coefficient:  0.5958710714292471

A correlation coefficient of 0.6 suggests of a moderately strong relationship between happiness score and the log of coronavirus cases.

Summary

Similar to the original data (section 2b), taking the log of coronavirus cases and comparing that to the happiness scores yielded more linear correlation (+.6). This elucidates that before taking the log, there may be an exponential increase in coronavirus viruses. Compared to the original data, there was a smaller coefficient of determination (.355), which means that less of the variation in the coronavirus cases could be predicted by model established by the happiness scores; however, this still indicates that taking the log of coronavirus cases yielded a more linear relationship. Furthermore, from the hypothesis testing, the p-value, was once again less than the significance level of 0.05, which suggests that our findings of the relationships between happiness score and coronavirus cases are statistically significant and not likely the consequence of random chance. These findings echo the findings from the original data set, which suggests that little/no changes occurred even when the data set length has changed.

In General:

Regarding happiness score and coronavirus cases, the positive correlation between the two variables suggests that countries with higher happiness scores (with societal factors including quality of life, economic status of the country, life expectancy, social support, etc.) have more reported cases. Some explanatory factors may include the fact that countries that enjoy higher economic status have more resources to test coronavirus cases. Similarly, countries with more social support and higher quality of life may encourage more people to get tested and/or require people get tested because people are more exposed and made aware of the fact that the disease exists within the country, which may be why such countries have higher counts of coronavirus cases.

3c. Comparing Healthcare Ranking and Corona Cases Using Joined Data

In [16]:
#scatterplot
plt.figure(figsize=(5,5))
plt.scatter(combine_filtered_1["Healthcare Ranking"].astype("float"),combine_filtered_1["Cases"].astype("float"))
plt.title("Coronavirus Cases and Healthcare ranking")
plt.ylabel("Coronavirus Cases/1M population")
plt.xlabel("Healthcare Ranking")
plt.show()
Out[16]:

based on the scatterplot, there is a negative relationship between healthcare ranking and coronavirus cases

In [17]:
#residual plot
plt.figure(figsize=(5, 5))
sns.residplot(combine_filtered_1["Healthcare Ranking"].astype("float"), combine_filtered_1["Cases"].astype("float"), lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Healthcare Ranking')
plt.title("Residuals for Healthcare vs. Coronavirus Cases")
plt.show()
Out[17]:

like the original data, the residuals show a non random distribution, suggesting a linear model is not appropriate

In [18]:
#correlation coefficient
health_array = np.array(combine_filtered_1["Healthcare Ranking"].astype(float))
case_array = np.array(combine_filtered_1["Cases"].astype(float))
correlation = np.corrcoef(health_array, case_array)[0,1]
print("Correlation Coefficient: ", correlation)
Correlation Coefficient:  -0.5670851216335175

this correlation coefecient shows a moderatly strong negative correlation, which is the same as found in the original data

In [20]:
#scatterplot 
caseslog = np.log(combine_filtered_1["Cases"].astype("float"))

health = np.array(combine_filtered_1["Healthcare Ranking"].astype(float))

plt.figure(figsize=(5,5))
plt.scatter(health, caseslog)
plt.xlabel("healthcare ranking")
plt.ylabel("Cases/1M Population (Log)")
plt.title("Healthcare ranking compared to the Log of Coronavirus Cases")
plt.show()
Out[20]:

this scatterplot shows a negative relationship, like with the original data set. it looks more linear than the before the log was taken, when it looked more exponential

In [21]:
#residual plot
plt.figure(figsize=(5, 5))
sns.residplot(health, caseslog, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('healthcare ranking')
plt.title("Residuals for Healthcare ranking vs. Log of Coronavirus Cases")
plt.show()
Out[21]:

like with the original data, the residual plot is random once the log of coronavirus cases is taken. this means a linear model may be appropriate

In [23]:
#linear regression
health2 = np.array(combine_filtered_1["Healthcare Ranking"].astype(float))
caseslog1 = np.array(np.log(combine_filtered_1["Cases"].astype("float")))
newhealth = combine_filtered_1[["Healthcare Ranking"]].astype(float)
log_model = LinearRegression().fit(newhealth, caseslog)
print("Linear Regression for Healthcare and Log of Coronavirus Cases: ")
print("Regression Slope: ", log_model.coef_[0])
print("Regression Intercept: ", log_model.intercept_)
print("Coefficient of determination: ", log_model.score(newhealth, caseslog))
correlation = np.corrcoef(health2, caseslog1)[0,1]
print("Correlation Coefficient: ", correlation)
Linear Regression for Healthcare and Log of Coronavirus Cases: 
Regression Slope:  -0.03871461269771666
Regression Intercept:  7.6769572349858075
Coefficient of determination:  0.3551613429734014
Correlation Coefficient:  -0.5959541450257742
$$\hat{y} = -0.0387*(x) - 7.6769$$
  • where y is log of coronavirus cases and x is the healthcare ranking of a country
  • Based on the linear regression model established between healthcare ranking and the log of coronavirus cases:
    • The regression slope value of -.0387 indicates for every additional increase in healthcare ranking, the country has -.0387 times less coronavirus cases(per 1M population).
    • The regression intercept of 7.67 is hard to interpret because a log regression assumes a continous random variable but we do not have that since healthcare rankings are not continous and only go to one hundred. also there is no healthcare ranking of zero, which is a limitation of our model
    • The coefficient of determination showcases the amount of variation in the dependent variable that can be predicted by the independent variable. A value of 0.355 means that 35.5% of the variation in the coronavirus cases can be predicted by the linear regression modeled with helathcare ranking.
    • the correlation coefficient suggests a pretty strong negative relationship
In [65]:
health2 = np.array(combine_filtered_1["Healthcare Ranking"].astype(float))
caseslog1 = np.array(np.log(combine_filtered_1["Cases"].astype("float")))

#hypothesis test
m = log_model.coef_[0]
b = log_model.intercept_
pred = m*health2 + b
se = standard_error(health2, caseslog1, pred)

#t-statistic
t = m/se
df = len(health2)-2

p = stats.t.cdf(t,df=df)

#printing the values
print("The t-statistic: ", t)
print("The p-value: ", p)
The t-statistic:  -6.427142722357519
The p-value:  5.3879410432303136e-09

H0: There is no relationships between healthcare ranking and log of coronavirus cases

H1: There is a relationship between healthcare ranking and log of coronavirus cases

Since the p-value, which is essentially 0, is less than significance level of 0.05, we reject the null hypothesis. This means that there is statistically significant evidence of a linear relationship between healthcare ranking and the log of coronavirus cases, and that our findings are not due to random chance.

In General:
  • In comparing the data from the original data and the joined data, although some of values altered due the difference in length and content(s) of the data set, the relationships established and the general trends of the data remained the same.

  • In both data sets, it appears that there is a moderate to strong negative relationship between healthcare ranking and coronavirus cases.

Summary

The correlation changed from -.50 with the original data (section 2a) to -.56 with just the joined data, which may suggest that some outliers were taken out when the data was joined, making the correlation stronger with just the joined data. The scatterplot for this data was once again not random so we did not do a t-test on it. When we took the log of the joined cases using the joined data, we found that the scatterplot for the residuals was more random, so we did a linear regression and hypothesis test on that. The slope changed from -.0385 to −0.0387, which is a very small change. We found the p value to be 1.1x10-6, which once again suggests there is a linear relationship between the two variables. Overall, using the joined data did not change our findings significantly.

General Context

Overall, our findings are that countries with higher healthcare ranks (meaning a worse healthcare system), have less coronavirus cases. This could be because countries with worse healthcare systems have less ability to test people, so they are currently reporting less cases. Even in countries with lower ranks (better healthcare systems), testing often falls significantly short of the actual number of cases. For example, in some places where tests are short, they are only testing people for the virus if it would change their treatment of the patient, otherwise they just treat the symptoms as they present themselves. This is likely leading to an undercount of cases.

3d. Comparing Healthcare Rankings and Happiness Scores Using Joined Data

In [34]:
# scatterplot
plt.figure(figsize=(5,5))
plt.scatter(combine_filtered_1["Healthcare Ranking"].astype("float"),combine_filtered_1["Happiness Score"].astype("float"))
plt.title("Healthcare Ranking and Happiness Scores")
plt.ylabel("Happiness Score")
plt.xlabel("Healthcare Ranking")
plt.show()
Out[34]:

This scatterplot is very similar to the one created with the original data set. There appears to be a moderate negative relationship between healthcare rankings and happiness scores.

In [36]:
# residual plot
happy_array = np.array(combine_filtered_1["Healthcare Ranking"].astype("float"))
health_array = np.array(combine_filtered_1["Happiness Score"].astype("float"))
plt.figure(figsize=(5,5))
sns.residplot(health_array, happy_array, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Healthcare Ranking')
plt.title("Residuals for Healthcare Ranking vs. Happiness Score")
plt.show()
Out[36]:

The residual plot shows no pattern, demonstrating that a linear regression model is plausible in this case.

In [43]:
#linear regression 
happiness = combine_filtered_1[["Happiness Score"]].astype("float")
healthcare = combine_filtered_1[["Healthcare Ranking"]].astype("float")
happ_model = LinearRegression().fit(healthcare, happiness)
r = happ_model.score(healthcare, happiness)
print("Linear Regression for Happiness Score and Healthcare Ranking: ")
print("Regression Slope:", happ_model.coef_[0][0])
print("Regression Intercept:", happ_model.intercept_[0])
print("Correlation coefficient:", r)
print("Coefficient of determination:", r*r)
Linear Regression for Happiness Score and Healthcare Ranking: 
Regression Slope: -0.01995459227932001
Regression Intercept: 7.021134463183175
Correlation coefficient: 0.36678039466053325
Coefficient of determination: 0.13452785790733654
  • Based on the linear regression model established between happiness scores and healthcare rankings:
    • The regression slope value of -.01995 indicates for every additional increase in healthcare ranking, the happiness score of the country decreases by .01995.
    • The regression intercept of 7.0211 suggests that when the healthcare ranking of a country is 0, the happiness score will be 7.0211. In a real-life context, this means that the country with the best healthcare is predicted to have a happiness score of 7.0211. Like mentioned in section 2c, there is no real life healthcare ranking of 0, so we must be cautious with some of the conclusions drawn from the linear regression.
    • The coefficient of determination showcases the amount of variation in the dependent variable that can be predicted by the independent variable. A value of 0.1345 means that 13.45% of the variation in happiness scores can be predicted by the linear regression modeled with healthcare ranking.
    • the correlation coefficent of .367 suggests a moderate positive relationship between healthcare rankings and happiness scores.

Using the model of y = mx+b, the regression line that predicts happiness scores from healthcare rankings is: $$\hat{y} = -0.01995*(x) + 7.021$$

with y as happiness score and x as healthcare ranking.

In [46]:
# hypothesis test for happiness score and healthcare ranking
happiness = combine_filtered_1["Happiness Score"].astype("float")
healthcare = combine_filtered_1["Healthcare Ranking"].astype("float")

inter = happ_model.intercept_[0]
slope = happ_model.coef_[0][0]

# finding standard error
pred = slope*healthcare + inter
df = len(healthcare)-2
se = standard_error(healthcare, happiness, pred)

t = slope/se # t-statistic
p =  stats.t.cdf(t,df=df) # p value given t statistic and degrees of freedom
print("t statistic:", t)
print("standard error:", se)
print("degrees of freedom:", df)
print("p value for signaficance level .05: ", float(p))
t statistic: -6.591078923850045
standard error: 0.0030275152990679017
degrees of freedom: 75
p value for signaficance level .05:  2.683863619300173e-09
  • H0: There is no relationships between healthcare ranking and happiness scores
  • H1: There is a relationship between healthcare ranking and happiness scores
  • Since the p value is less than significance level, reject the null hypothesis. There is evidence that there is a linear relationship between happiness scoress and healthcare ranking
Summary

The joined data had a length of 77, meaning that only one datapoint was removed from the original dataset. Therefore, the findings from the joined dataset were roughly the same as in the original (section 2c).

General Context

Overall, we see that countries with higher healthcare rankings (meaning a worse healthcare system), have lower happiness scores. This makes sense because having “good” healthcare systems may impact the quality of life in a country, i.e. resulting in higher happiness scores in countries with better healthcare systems. Countries with higher healthcare rankings (worse healthcare systems) also seem to be smaller island countries. Five of the 10 countries with the worst healthcare ratings are small island countries, while only one of the top 10 countries falls into this category. It would be interesting to do further research to see if the size, location, and average temperature of countries have any correlation with healthcare rankings and happiness scores.

4. Conclusion

General Summary

Overall, we saw that both healthcare rankings (government institutions) and happiness scores (societal factors) have a relatively strong relationship with the number of coronavirus cases per country. Healthcare rankings have a negative relationship with the number of coronavirus cases. When just examining these two variables, this does not make sense because one would think that countries with better healthcare have less coronavirus cases. Again, when we look at happiness scores and coronavirus cases, we see that countries with higher happiness scores have more coronavirus cases, which is also the opposite of what we might expect; once again, external factors may be at play. However, since we saw that there is a negative relationship between happiness scores and healthcare ranking (i.e. countries with better healthcare had higher happiness scores), it makes sense that we see the same trends between the two variables and corona cases per country.

The Bigger Picture

Within the wider context of real-life application, elements such as happiness scores and healthcare rankings are influenced by a variety of factors, including income, quality of life, freedom, etc. Additionally, such societal factors can directly influence the number of coronavirus cases, as income and quality of life can determine how accessible and how feasible social distancing is for people. Likewise, it makes sense that general healthcare rankings are correlated with coronavirus cases, as countries with higher rankings probably have more emphasis on treatment and guidelines for the citizens. In our data analysis, we analyze the bigger picture, using countries as a whole; however, countries have certain “hotspots”, where people may be gathered and/or cases may be more frequently reported based on the resources that are available. The uneven distribution of wealth and access to healthcare in different regions of countries may explain disparities within countries that may not be seen via the “big picture” from just viewing numbers of the country as a whole. In more developed countries, where resources are more available, the numbers reported may be higher because there is more access to tests and to medical care. This means that our findings when it comes to healthcare and coronavirus cases may prove to be reversed by the end of the pandemic.

Considerations for Data

Since we did all of our data collection and analysis in the middle of the pandemic, the numbers we collected for coronavirus cases are not the final number of cases in that country. Although we found some preliminary evidence, in order to draw stronger conclusions, it would be useful to re-run the tests with the final numbers once the pandemic is over. Furthermore, due to the widespread lack of testing in some places, some countries may be under represented in the coronavirus cases data. Additionally, the concept of “heteroscedasticity”, applies to many comparisons between in our dataset. In the instances where we took the log of coronavirus cases, the data analyzed had small variances concentrated in one area and large variances concentrated in another, leading to exponential relationships as opposed to linear relationships. Since our data had heteroscedasticity, we decided to analyze the independent variables with respect to the log of the dependent variable to determine what relationships and correlations existed among the data.

Prior Things Done

With regards to the coronavirus data set, we tried to web scrape the table with all the cases from the site; however, this proved to be difficult because the number of cases scraped from the website did not align with the proper country due to the HTML formatting of the site. Therefore, we chose a different method of obtaining the data instead.

Further Considerations

In further studies, we could look at cell phone data to see how well people are following social distancing rules and compare that with the number of cases in the country. Additionally, we can perhaps find similar factors and/or confounding variables regarding government institutions and societal factors that can determine whether one has more of an effect on the number of coronavirus cases. Furthermore, some other considerations include age distribution and population density for countries. Within our data and given the coronavirus information provided to the public, it is suggested that people in higher age demographics are more susceptible to the illness and that countries with higher population densities have more equipped healthcare systems. Such factors can perhaps also explain and affect the number of coronavirus cases for a specific country.

5. Acknowledgements:

We would like to thank Professor Mimno for giving us feedback throughout the process and looking at different iterations of our projects. Our TAs were also very helping in answering questions that came up. In addition, https://stattrek.com/regression/slope-test.aspx was a helpful resource in figuring out how to run our linear regression hypothesis tests.