COVID-19, also known colloquially as the coronavirus, is an ongoing pandemic that has reached almost every part of the world, and is affecting all of our lives in some way. Different countries have been handling the pandemic in different ways, ranging from complete lockdowns (China), to letting the virus run its course (Sweden). In addition to handling the virus in varied ways, countries also have a wide range of healthcare systems. We were curious about what effect different factors from the government and society have on coronavirus cases.
In an attempt to measure these effects, we looked at the happiness scores, healthcare ranking, and coronavirus cases by country. Our data shows that both government and societal factors of a country have an effect on the number of coronavirus cases. Specifically, we found that there is a significant correlation between happiness scores and coronavirus cases, as well as between healthcare rankings and coronavirus cases. These findings are important because they show that there is not only one aspect of a country that plays a role in the coronavirus pandemic. In a broader relation to global health efforts, it is crucial to examine various components and to consider that there are external factors other than government intervention may contribute to a specific event.
To look at the government aspects, we looked at the relationship between the healthcare of a country and the number of cases of coronavirus in that country. The healthcare rankings were based on a study by the World Health Organization. The factors include: care process (preventative care measures, safe care, coordinated care, and engagement and patient preferences), access (affordability and timeliness), administrative efficiency, equity, and healthcare outcomes (population health, mortality amenable to healthcare, and disease-specific health outcomes).
To look at the societal factors, we looked at the relationship between the happiness scores of a country and the number of cases of coronavirus in that country. The happiness scores and rankings data came from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative.
Overall we were trying to determine whether the healthcare system of a country or the happiness level of a country truly impact what will happen during a pandemic, or is an external variable such as how society handles the pandemic more of an indicator. In order to see what effect happiness scores and healthcare ranking had on coronavirus cases, we made a linear regression for each combination of variables, as well as a multiple linear regression. If appropriate, we performed t-tests to check for significance (i.e to ensure the relationships established are not due to random chance) and to determine the linearity of the data set(s).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import math
import seaborn as sns
healthcare = pd.read_csv("healthcare.csv")
healthcare.head()
happiness = pd.read_csv("2019happiness.csv")
happiness.head()
covid_cases = pd.read_csv("COVID-19Cases.csv")
covid_cases.head()
# calculates the standard error of residuals
def standard_error(x,y,pred):
x_mean = x.mean()
n = len(x)
sum_x = np.sum((x-x_mean)**2)
sum_y = np.sum((y-pred)**2)
se = (sum_y/((n-2)*sum_x))**(1/2)
return se
In section 2, we look at the relationship between each pair of datasets. This will give us a better idea about how these variables relate in a larger sample size before we combine all three factors into one dataframe for further analysis
# making data frame
covid_health = pd.DataFrame(healthcare["Country"])
covid_health["Healthcare ranking"] = healthcare["healthcareRank"]
covidcases = []
for x in covid_health["Country"]:
cases = covid_cases[covid_cases["Country,\nOther"]==x]
covidcases.append(cases["Tot Cases/\n1M pop"].to_string(index=False))
covid_health["Cases Per Million"] = covidcases
new_covid_health = covid_health[covid_health["Cases Per Million"]!="Series([], )"]
## making all cases into floats
cases = []
for y in new_covid_health["Cases Per Million"]:
cases.append(y.replace(',',''))
new_covid_health["Cases Per Million"] = cases
new_covid_health.loc['Cases Per Million'] = new_covid_health['Cases Per Million'].astype('float')
## making all rankings into floats
new_covid_health.loc['Healthcare ranking'] = new_covid_health['Healthcare ranking'].astype('float')
## taking out Na/N data
new_covid_health = new_covid_health.dropna(subset = ["Cases Per Million"], inplace = False)
new_covid_health.head()
## making scatterplot
plt.figure(figsize = (5,5))
plt.scatter(new_covid_health['Healthcare ranking'].astype("float"),new_covid_health['Cases Per Million'].astype("float"))
plt.xlim(0,120)
plt.ylim(0,7000)
plt.ylabel('cases per million')
plt.xlabel('healthcare ranking')
plt.title("Healthcare ranking and coronavirus cases per million ")
plt.yticks()
plt.show()
Based on the scatter plot, it appears that there is a negative correlation between healthcare ranking and coronavirus cases, meaning as healthcare ranking gets higher, the country sees less cases.
xarray = np.array(new_covid_health["Healthcare ranking"].astype("float")) #healthcare ranking
yarray = np.array(new_covid_health["Cases Per Million"].astype("float")) #cases
plt.figure(figsize = (5,5))
sns.residplot(xarray, yarray, lowess=False, color="b")
plt.title("Residual plot for corona cases and healthcare ranking")
plt.show()
There appears to be some pattern to the residual plot, which suggests that a linear regression is not likely to be appropriate.
#correlation
correlation = np.corrcoef(xarray, yarray)[0,1]
print("Correlation coefficient: ", correlation)
print("coefficeint of determinism: ",correlation**2)
A correlation coeffecient of -.5 suggests there is a pretty strong negative realtionship between the two variables the coefficient of determinism is .25, meaning 25% of the variability can be explained by the model. However since the residuals are not random, we decided to take the log of the coronavirus cases to make a linear model. this is appropriate because cases are increasing exponentially
#scatterplot
yarraylog = np.log(yarray) #taking the log of cases
plt.figure(figsize=(5,5))
plt.scatter(xarray, yarraylog)
plt.xlim(0,120)
plt.ylim(0,30)
plt.xlabel("Healthcare ranking Score")
plt.ylabel("Cases/1M Population (Log)")
plt.title("Helathcare Ranking compared to the Log of Coronavirus Cases")
plt.show()
Based on the scatterplot, it appears that there is a relatively strong negative relationship between healthcare score and the log of coronavirus cases and that a linear relationship better describes this data than the original.
plt.figure(figsize = (5,5))
sns.residplot(xarray, yarraylog, lowess=False, color="b")
plt.title("Residual plot for log of corona cases and healthcare ranking")
plt.show()
This residual plot is much more random than with the original data, suggesting a linear model is more appropriate
#linear regression
newmodel = LinearRegression().fit(new_covid_health[['Healthcare ranking']].astype("float"),np.log(new_covid_health[['Cases Per Million']].astype("float")))
print("Linear Regression for Healthcare ranking and Log of Coronavirus Cases: ")
print("Regression Slope: ", newmodel.coef_[0])
print("Regression Intercept: ", newmodel.intercept_)
print("Coefficient of determination: ", newmodel.score(new_covid_health[['Healthcare ranking']].astype("float"),np.log(new_covid_health[['Cases Per Million']].astype("float"))))
health = np.array(new_covid_health['Healthcare ranking'].astype("float"))
correlation = np.corrcoef(health, yarraylog)[0,1]
print("Correlation Coefficient: ", correlation)
#hypothesis test for log corona and healthcare ranking
slope = newmodel.coef_[0]
intercept = newmodel.intercept_
m = float(slope)
b = float(intercept)
df = len(xarray)-2
## finding standard error
preds = m*xarray +b
se = standard_error(xarray,yarraylog,preds)
t = m/se # test statistic
p = stats.t.cdf(t,df=df)
print("t statistic: ", t)
print("standard error: ", se)
print("degrees of freedom: ", df)
print("p value for signiface level .05: ", p)
In the original data, which has 94 data points, the scatterplot and correlation coefficient (-.50) suggest a negative correlation between the two variables. This means that countries with higher healthcare rankings, meaning worse healthcare systems, have less coronavirus cases. The scatter plot for residuals for the original data appeared to have some sort of pattern and didn’t appear to be random, so we did not perform a hypothesis test on it. We decided to take the log of the coronavirus cases because the cases are increasing exponentially in some places. After taking the log of the coronavirus cases, the residuals appeared much more random. We performed a t-test on the log of coronavirus cases, and found a p value of 1.17 x 10-11, which is less than our significance level of .05. This suggests that there is strong evidence that the relationship between coronavirus cases and healthcare ranking is not due to random chance.
#creating the dataframe for happiness scores and coronavirus cases
happiness_covid = pd.DataFrame(happiness['Country'])
happiness_covid["Happiness Score"] = happiness["Score"]
cases = []
for country in happiness_covid["Country"]:
row = covid_cases[covid_cases["Country,\nOther"]==country]
cases.append(row["Tot Cases/\n1M pop"].to_string(index=False))
happiness_covid["Cases"] = cases
happiness_covid_filtered = happiness_covid[happiness_covid["Cases"]!='Series([], )']
happiness_covid_filtered.head()
#scatterplot for coronavirus and happiness score
happiness_score = happiness_covid_filtered['Happiness Score']
cases_covid = []
cases_filtered = []
for y in happiness_covid_filtered["Cases"]:
cases_covid.append(y.replace(' ', ''))
for x in cases_covid:
cases_filtered.append(x.replace(',',''))
happiness_covid_filtered.loc[:, "Cases"] = cases_filtered
happiness_covid_filtered['Cases'].astype("float")
cases = happiness_covid_filtered['Cases'].astype("float")
plt.figure(figsize=(5, 5))
plt.ylabel("Coronavirus Cases (Total Cases/1M population)")
plt.xlabel("Happiness Score")
plt.title("Coronavirus Cases and Happiness Scores, by Country")
plt.scatter(happiness_score,cases)
plt.show()
Based on the scatterplot, it appears that a positive correlation between happiness score and coronavirus cases, which means that for an increase in happiness score, there is an increase in coronavirus cases.
#residual plot
covid_cases = happiness_covid_filtered["Cases"].astype("float")
plt.figure(figsize=(5, 5))
sns.residplot(happiness_score, covid_cases, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Happiness Score')
plt.title("Residuals for Happiness Score vs. Coronavirus Cases")
plt.show()
There appears to be some pattern to the residual plot, which suggests that a linear regression is not likely to be appropriate.
#additional values
#correlation
happy = np.array(happiness_covid_filtered["Happiness Score"])
case = np.array(happiness_covid_filtered["Cases"].astype(float))
correlation = np.corrcoef(happy, case)[0,1]
print("Correlation coefficient: ", correlation)
A correlation coefficient of 0.57 suggests that there is a relatively strong positive relationship between coronavirus cases and happiness scores.
In the original data, where the data was at a length of 143, the scatterplot and the correlation coefficient (+0.57) indicate that a positive correlation exists between happiness scores and coronavirus cases. Analyzed in a wider context, it suggests that increases in happiness scores within countries lead to more coronavirus cases. One reason to explain such findings can be that countries with higher happiness scores are more readily available and have more access to resources to test and report instances of COVID-19. This may explain why happier countries have exponentially higher cases, in comparison to less happier countries.
#scatterplot
cases = happiness_covid_filtered["Cases"].astype("float")
cases = np.log(cases)
plt.figure(figsize=(5,5))
plt.scatter(happiness_covid_filtered["Happiness Score"], cases)
plt.xlabel("Happiness Score")
plt.ylabel("Cases/1M Population (Log)")
plt.title("Happiness Score compared to the Log of Coronavirus Cases")
plt.show()
Based on the scatterplot, it appears that there is a relatively strong positive relationship between happiness score and the log of coronavirus cases and that a linear relationship better describes this data than the original.
#residual plot
plt.figure(figsize=(5, 5))
sns.residplot(happiness_score, cases, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Happiness Score')
plt.title("Residuals for Happiness Score vs. Log of Coronavirus Cases")
plt.show()
The residual plot indicates little to no pattern, which suggests that a linear regression is appropriate.
#linear regression
logmodel = LinearRegression().fit(happiness_covid_filtered[["Happiness Score"]], cases)
print("Linear Regression for Happiness Score and Log of Coronavirus Cases: ")
print("Regression Slope: ", logmodel.coef_[0])
print("Regression Intercept: ", logmodel.intercept_)
print("Coefficient of determination: ", logmodel.score(happiness_covid_filtered[["Happiness Score"]], cases))
Using the model of y = mx+b, the regression line that predicts coronavirus cases (log) from happiness scores is: $$\hat{y} = 1.64*(x) - 4.37$$
with y as log of coronavirus cases and x as happiness score.
#hypothesis test
m = logmodel.coef_[0]
b = logmodel.intercept_
pred = m*happy + b
se = standard_error(happy, cases, pred)
#t-statistic
t = m/se
df = len(happy)-2
p = 1-stats.t.cdf(t,df=df)
#printing the values
print("The t-statistic: ", t)
print("The p-value: ", p)
H0: There is no relationships between happiness scores and log of coronavirus cases
H1: There is a relationship between happiness scores and log of coronavirus cases
Since the p-value, which is essentially 0, is less than significance level of 0.05, we reject the null hypothesis. This means that there is statistically significant evidence of a linear relationship between happiness scores and the log of coronavirus cases, and that our findings are not due to random chance.
#correlation
happy = np.array(happiness_covid_filtered["Happiness Score"])
correlation = np.corrcoef(happy, cases)[0,1]
print("Correlation Coefficient: ", correlation)
A correlation coefficient of 0.75 suggests that there is a strong positive relationship between coronavirus cases and happiness scores. Compared to the correlation established between the original data without taking the log of coronavirus cases, this correlation coefficient suggests of a stronger positive linear relationship.
With the original data, we also explored the relationship between happiness scores and the log of coronavirus cases. From the scatterplot and correlation coefficient (.75), it seems that there is a relatively strong positive relationship between the two variables. An explanation for such a finding can be that the original data (without the log) was exponentially correlated to one another, which means that for every additional increase in happiness score, there was an exponential increase in coronavirus cases. Taking the log of coronavirus cases and comparing that to happiness score transforms the data relationship to a more linear relationship. When analyzing the data with the log of coronavirus cases, the coefficient of determination was .567, which means that there nearly 57% more of the variation in coronavirus cases (log model) could be explained by happiness scores for the country. Additionally, the residual plot went from showcasing a pattern to showcasing no pattern, which suggests that a linear relationship can be used when comparing happiness scores with the log of coronavirus cases. From the hypothesis testing of the linear regression, the p-value, essentially 0, was less than the significance level of 0.05, which suggests that our findings of the relationships between happiness score and coronavirus cases are statistically significant and not likely the consequence of random chance.
# combine healthcare and happiness dataframes
health_happiness = pd.DataFrame(healthcare['Country'])
health_happiness["Healthcare Ranking"] = healthcare["healthcareRank"]
scores = []
for country in health_happiness["Country"]:
row = happiness[happiness["Country"]==country]
scores.append(row["Score"].to_string(index=False))
health_happiness["Happiness Score"] = scores
health_happiness_filtered = health_happiness[health_happiness["Happiness Score"]!='Series([], )']
health_happiness_filtered.head()
# scatterplot for healthcare ranking and happiness score
healthcare_ranking = health_happiness_filtered['Healthcare Ranking']
happiness_scores = health_happiness_filtered['Happiness Score'].astype("float")
plt.figure(figsize=(5,5))
plt.ylabel("Happiness Score")
plt.xlabel("Healthcare Ranking")
plt.title("Happiness Score and Healthcare Ranking, by Country")
plt.scatter(healthcare_ranking,happiness_scores)
plt.yticks
plt.show()
Based on the scatterplot, it appears that there is a relatively weak negative relationship between healthcare rankings and happiness scores, meaning that as healthcare ranking increases, happiness scores decrease. This makes sense since larger healthcare rankings correspond to worse healthcare. It is likely that countries with worse healthcare also have lower qualities of life (happiness scores).
# residual plot
happy_array = np.array(health_happiness_filtered["Happiness Score"].astype("float"))
health_array = np.array(health_happiness_filtered["Healthcare Ranking"])
plt.figure(figsize=(5,5))
sns.residplot(health_array, happy_array, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Healthcare Ranking')
plt.title("Residuals for Healthcare Ranking vs. Happiness Score")
plt.show()
The residual plot shows no pattern, demonstrating that a linear regression model is plausible in this case.
# create linear moded + analyze summary stats
healthcare = health_happiness_filtered[["Healthcare Ranking"]]
happiness = health_happiness_filtered[["Happiness Score"]]
model = LinearRegression().fit(healthcare, happiness)
print("Linear Regression for Healthcare Ranking and Happiness Score: ")
slope = model.coef_[0]
print("Regression Slope: ", model.coef_[0][0])
intercept = model.intercept_
print("Regression Intercept: ", model.intercept_[0])
r = model.score(healthcare, happiness)
print("Correlation Coefficient: ", r)
print("Coefficient of Determinisim: ", r**2)
Using the model of y = mx+b, the regression line that predicts happiness scores from healthcare rankings is: $$\hat{y} = -0.01995*(x) + 7.03103$$
with y as happiness score and x as healthcare ranking.
# hypothesis test for happiness score and healthcare ranking
happiness = health_happiness_filtered["Happiness Score"].astype("float")
healthcare = health_happiness_filtered["Healthcare Ranking"].astype("float")
inter = model.intercept_[0]
slope = model.coef_[0][0]
# finding standard error
pred = slope*healthcare + inter
df = len(healthcare)-2
se = standard_error(healthcare, happiness, pred)
t = slope/se # t-statistic
p = stats.t.cdf(t,df=df) # p value given t statistic and degrees of freedom
print("t statistic:", t)
print("standard error:", se)
print("degrees of freedom:", df)
print("p value for significance level .05: ", float(p))
In the original data, which has 78 data points, the scatterplot and correlation coefficient (.36) suggest a positive correlation between the two variables. This means that countries with higher healthcare rankings (meaning worse healthcare systems) have lower happiness scores. Because the residuals were random, we were able to create a linear model. The intercept of the model was −0.01995 with a slope of 7.03. This means for every additional increase in healthcare ranking, there is a −0.01995 decrease in happiness score. We performed a t-test with the data and found a p value of 2.65 x 10-9, which is less than our significance level of .05. This suggests that there is strong evidence that the relationship between happiness scores and healthcare rankings is not due to random chance.
In this section, we are combining all three datasets into one dataframe. This will allow us to create a multiple linear regression, as well as look at the relationships between each pair of datasets with a consistent set of countries.
### multiple linear regression
combine = pd.DataFrame(healthcare["Country"])
combine.loc[:, "Healthcare Ranking"] = healthcare['healthcareRank']
cases = []
casesh = []
happiness_rank = []
for country in combine["Country"]:
row = covid_cases[covid_cases["Country,\nOther"]==country]
cases.append(row["Tot Cases/\n1M pop"].to_string(index=False))
rowh = happiness[happiness["Country"]==country]
casesh.append(rowh["Score"].to_string(index=False))
combine.loc[:, "Cases"] = cases
combine.loc[:, "Happiness Score"] = casesh
combine_filtered = combine[combine["Happiness Score"]!="Series([], )"]
combine_filtered_1 = combine_filtered[combine_filtered["Cases"]!="Series([], )"]
cases_covid = []
cases_filtered = []
happiness_scores = []
for y in combine_filtered_1["Cases"]:
cases_covid.append(y.replace(' ', ''))
for x in cases_covid:
cases_filtered.append(x.replace(',',''))
for a in combine_filtered_1["Happiness Score"]:
happiness_scores.append(a.replace(' ', ''))
combine_filtered_1.loc[:, "Cases"] = cases_filtered
combine_filtered_1.loc[:, "Happiness Score"] = happiness_scores
combine_filtered_1.head()
# scatterplot with all three variables
plt.figure(figsize=(5,5))
plt.ylabel("Coronavirus Cases")
plt.xlabel("Healthcare Ranking +")
plt.title("Happiness Score and Healthcare Ranking, by Country")
case_health = plt.scatter(combine_filtered_1["Healthcare Ranking"].astype("float"),combine_filtered_1["Cases"].astype("float"))
case_happy = plt.scatter(combine_filtered_1["Happiness Score"].astype("float"),combine_filtered_1["Cases"].astype("float"))
plt.legend((case_health, case_happy),("Healthcare Ranking", "Happiness Score"))
plt.yticks
plt.show()
#multiple regression coefficients
model_combined = LinearRegression().fit(combine_filtered_1[["Healthcare Ranking", "Happiness Score"]].astype("float"), combine_filtered_1["Cases"].astype("float"))
print("Predicting Coronavirus cases, based on healthcare ranking and happiness score: ")
print("")
print("Regression slope for healthcare ranking: ", (model_combined.coef_)[0])
print("Regression slope for Happiness score: ", (model_combined.coef_)[1])
#scatterplot
plt.figure(figsize=(5,5))
plt.scatter(combine_filtered_1["Happiness Score"].astype("float"),combine_filtered_1["Cases"].astype("float"))
plt.title("Coronavirus Cases and Happiness Scores")
plt.ylabel("Coronavirus Cases/1M population")
plt.xlabel("Happiness Score")
plt.show()
Based on the scatterplot, there appears to be a positive relationship between coronavirus cases and happiness scores.
#residual plot
plt.figure(figsize=(5, 5))
sns.residplot(combine_filtered_1["Happiness Score"].astype("float"), combine_filtered_1["Cases"].astype("float"), lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Happiness Score')
plt.title("Residuals for Happiness Score vs. Coronavirus Cases")
plt.show()
Similar to the original data, there appears to be some pattern to the residuals, which indicate that a linear model may not be appropriate.
#additional values
#correlation coefficient
happy_array = np.array(combine_filtered_1["Happiness Score"].astype(float))
case_array = np.array(combine_filtered_1["Cases"].astype(float))
correlation = np.corrcoef(happy_array, case_array)[0,1]
print("Correlation Coefficient: ", correlation)
A correlation coefficient of 0.52 suggests that there is a relatively strong positive relationship between coronavirus cases and happiness scores, which reflects what was found in the original data.
For the joined data, the length of the data was shortened to 77; from this merged data (will all three variables present), the scatterplot and the correlation coefficient (+0.52) suggests similar findings to the original data. While there may be a slight discrepancy in the values, it is evident that for additional increases in happiness score, there are increases in coronavirus cases.
#scatterplot
cases_comb = combine_filtered_1["Cases"].astype("float")
cases_comb = np.log(cases_comb)
happy_comb = combine_filtered_1["Happiness Score"].astype(float)
plt.figure(figsize=(5,5))
plt.scatter(happy_comb, cases_comb)
plt.xlabel("Happiness Score")
plt.ylabel("Cases/1M Population (Log)")
plt.title("Happiness Score compared to the Log of Coronavirus Cases")
plt.show()
Based on the scatterplot, there appears to be a relatively strong positive relationship between happiness score and the log of coronavirus cases. Additionally, there appears to be more of a linear trend, as opposed to an exponential trend (before taking the log of COVID-cases).
#residual plot
plt.figure(figsize=(5, 5))
sns.residplot(happy_comb, cases_comb, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Happiness Score')
plt.title("Residuals for Happiness Score vs. Log of Coronavirus Cases")
plt.show()
There appears to be no pattern to the residual plot, which indicates that a linear model may be appropriate.
#linear regression
happy_resize = combine_filtered_1[["Happiness Score"]].astype(float)
log_model = LinearRegression().fit(happy_resize, cases_comb)
print("Linear Regression for Happiness Score and Log of Coronavirus Cases: ")
print("Regression Slope: ", log_model.coef_[0])
print("Regression Intercept: ", log_model.intercept_)
print("Coefficient of determination: ", log_model.score(happy_resize, cases_comb))
Using the model of y = mx+b, the regression line that predicts coronavirus cases (log) from happiness scores is: $$\hat{y} = 1.17*(x) - 1.30$$
with y as log of coronavirus cases and x as happiness score.
#hypothesis test
m = log_model.coef_[0]
b = log_model.intercept_
pred = m*happy_array + b
se = standard_error(happy_array, cases_comb, pred)
#t-statistic
t = m/se
df = len(happy_array)-2
p = 1-stats.t.cdf(t,df=df)
#printing the values
print("The t-statistic: ", t)
print("The p-value: ", p)
H0: There is no relationships between happiness scores and log of coronavirus cases
H1: There is a relationship between happiness scores and log of coronavirus cases
Since the p-value, which is essentially 0, is less than significance level of 0.05, we reject the null hypothesis. This means that there is statistically significant evidence of a linear relationship between happiness scores and the log of coronavirus cases, and that our findings are not due to random chance.
#additional values
#correlation coefficient
happy = np.array(combine_filtered_1["Happiness Score"].astype(float))
correlation = np.corrcoef(happy, cases_comb)[0,1]
print("Correlation Coefficient: ", correlation)
A correlation coefficient of 0.6 suggests of a moderately strong relationship between happiness score and the log of coronavirus cases.
Similar to the original data (section 2b), taking the log of coronavirus cases and comparing that to the happiness scores yielded more linear correlation (+.6). This elucidates that before taking the log, there may be an exponential increase in coronavirus viruses. Compared to the original data, there was a smaller coefficient of determination (.355), which means that less of the variation in the coronavirus cases could be predicted by model established by the happiness scores; however, this still indicates that taking the log of coronavirus cases yielded a more linear relationship. Furthermore, from the hypothesis testing, the p-value, was once again less than the significance level of 0.05, which suggests that our findings of the relationships between happiness score and coronavirus cases are statistically significant and not likely the consequence of random chance. These findings echo the findings from the original data set, which suggests that little/no changes occurred even when the data set length has changed.
Regarding happiness score and coronavirus cases, the positive correlation between the two variables suggests that countries with higher happiness scores (with societal factors including quality of life, economic status of the country, life expectancy, social support, etc.) have more reported cases. Some explanatory factors may include the fact that countries that enjoy higher economic status have more resources to test coronavirus cases. Similarly, countries with more social support and higher quality of life may encourage more people to get tested and/or require people get tested because people are more exposed and made aware of the fact that the disease exists within the country, which may be why such countries have higher counts of coronavirus cases.
#scatterplot
plt.figure(figsize=(5,5))
plt.scatter(combine_filtered_1["Healthcare Ranking"].astype("float"),combine_filtered_1["Cases"].astype("float"))
plt.title("Coronavirus Cases and Healthcare ranking")
plt.ylabel("Coronavirus Cases/1M population")
plt.xlabel("Healthcare Ranking")
plt.show()
based on the scatterplot, there is a negative relationship between healthcare ranking and coronavirus cases
#residual plot
plt.figure(figsize=(5, 5))
sns.residplot(combine_filtered_1["Healthcare Ranking"].astype("float"), combine_filtered_1["Cases"].astype("float"), lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Healthcare Ranking')
plt.title("Residuals for Healthcare vs. Coronavirus Cases")
plt.show()
like the original data, the residuals show a non random distribution, suggesting a linear model is not appropriate
#correlation coefficient
health_array = np.array(combine_filtered_1["Healthcare Ranking"].astype(float))
case_array = np.array(combine_filtered_1["Cases"].astype(float))
correlation = np.corrcoef(health_array, case_array)[0,1]
print("Correlation Coefficient: ", correlation)
this correlation coefecient shows a moderatly strong negative correlation, which is the same as found in the original data
#scatterplot
caseslog = np.log(combine_filtered_1["Cases"].astype("float"))
health = np.array(combine_filtered_1["Healthcare Ranking"].astype(float))
plt.figure(figsize=(5,5))
plt.scatter(health, caseslog)
plt.xlabel("healthcare ranking")
plt.ylabel("Cases/1M Population (Log)")
plt.title("Healthcare ranking compared to the Log of Coronavirus Cases")
plt.show()
this scatterplot shows a negative relationship, like with the original data set. it looks more linear than the before the log was taken, when it looked more exponential
#residual plot
plt.figure(figsize=(5, 5))
sns.residplot(health, caseslog, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('healthcare ranking')
plt.title("Residuals for Healthcare ranking vs. Log of Coronavirus Cases")
plt.show()
like with the original data, the residual plot is random once the log of coronavirus cases is taken. this means a linear model may be appropriate
#linear regression
health2 = np.array(combine_filtered_1["Healthcare Ranking"].astype(float))
caseslog1 = np.array(np.log(combine_filtered_1["Cases"].astype("float")))
newhealth = combine_filtered_1[["Healthcare Ranking"]].astype(float)
log_model = LinearRegression().fit(newhealth, caseslog)
print("Linear Regression for Healthcare and Log of Coronavirus Cases: ")
print("Regression Slope: ", log_model.coef_[0])
print("Regression Intercept: ", log_model.intercept_)
print("Coefficient of determination: ", log_model.score(newhealth, caseslog))
correlation = np.corrcoef(health2, caseslog1)[0,1]
print("Correlation Coefficient: ", correlation)
health2 = np.array(combine_filtered_1["Healthcare Ranking"].astype(float))
caseslog1 = np.array(np.log(combine_filtered_1["Cases"].astype("float")))
#hypothesis test
m = log_model.coef_[0]
b = log_model.intercept_
pred = m*health2 + b
se = standard_error(health2, caseslog1, pred)
#t-statistic
t = m/se
df = len(health2)-2
p = stats.t.cdf(t,df=df)
#printing the values
print("The t-statistic: ", t)
print("The p-value: ", p)
H0: There is no relationships between healthcare ranking and log of coronavirus cases
H1: There is a relationship between healthcare ranking and log of coronavirus cases
Since the p-value, which is essentially 0, is less than significance level of 0.05, we reject the null hypothesis. This means that there is statistically significant evidence of a linear relationship between healthcare ranking and the log of coronavirus cases, and that our findings are not due to random chance.
In comparing the data from the original data and the joined data, although some of values altered due the difference in length and content(s) of the data set, the relationships established and the general trends of the data remained the same.
In both data sets, it appears that there is a moderate to strong negative relationship between healthcare ranking and coronavirus cases.
The correlation changed from -.50 with the original data (section 2a) to -.56 with just the joined data, which may suggest that some outliers were taken out when the data was joined, making the correlation stronger with just the joined data. The scatterplot for this data was once again not random so we did not do a t-test on it. When we took the log of the joined cases using the joined data, we found that the scatterplot for the residuals was more random, so we did a linear regression and hypothesis test on that. The slope changed from -.0385 to −0.0387, which is a very small change. We found the p value to be 1.1x10-6, which once again suggests there is a linear relationship between the two variables. Overall, using the joined data did not change our findings significantly.
Overall, our findings are that countries with higher healthcare ranks (meaning a worse healthcare system), have less coronavirus cases. This could be because countries with worse healthcare systems have less ability to test people, so they are currently reporting less cases. Even in countries with lower ranks (better healthcare systems), testing often falls significantly short of the actual number of cases. For example, in some places where tests are short, they are only testing people for the virus if it would change their treatment of the patient, otherwise they just treat the symptoms as they present themselves. This is likely leading to an undercount of cases.
# scatterplot
plt.figure(figsize=(5,5))
plt.scatter(combine_filtered_1["Healthcare Ranking"].astype("float"),combine_filtered_1["Happiness Score"].astype("float"))
plt.title("Healthcare Ranking and Happiness Scores")
plt.ylabel("Happiness Score")
plt.xlabel("Healthcare Ranking")
plt.show()
This scatterplot is very similar to the one created with the original data set. There appears to be a moderate negative relationship between healthcare rankings and happiness scores.
# residual plot
happy_array = np.array(combine_filtered_1["Healthcare Ranking"].astype("float"))
health_array = np.array(combine_filtered_1["Happiness Score"].astype("float"))
plt.figure(figsize=(5,5))
sns.residplot(health_array, happy_array, lowess=False, color="b")
plt.ylabel("Residuals")
plt.xlabel('Healthcare Ranking')
plt.title("Residuals for Healthcare Ranking vs. Happiness Score")
plt.show()
The residual plot shows no pattern, demonstrating that a linear regression model is plausible in this case.
#linear regression
happiness = combine_filtered_1[["Happiness Score"]].astype("float")
healthcare = combine_filtered_1[["Healthcare Ranking"]].astype("float")
happ_model = LinearRegression().fit(healthcare, happiness)
r = happ_model.score(healthcare, happiness)
print("Linear Regression for Happiness Score and Healthcare Ranking: ")
print("Regression Slope:", happ_model.coef_[0][0])
print("Regression Intercept:", happ_model.intercept_[0])
print("Correlation coefficient:", r)
print("Coefficient of determination:", r*r)
Using the model of y = mx+b, the regression line that predicts happiness scores from healthcare rankings is: $$\hat{y} = -0.01995*(x) + 7.021$$
with y as happiness score and x as healthcare ranking.
# hypothesis test for happiness score and healthcare ranking
happiness = combine_filtered_1["Happiness Score"].astype("float")
healthcare = combine_filtered_1["Healthcare Ranking"].astype("float")
inter = happ_model.intercept_[0]
slope = happ_model.coef_[0][0]
# finding standard error
pred = slope*healthcare + inter
df = len(healthcare)-2
se = standard_error(healthcare, happiness, pred)
t = slope/se # t-statistic
p = stats.t.cdf(t,df=df) # p value given t statistic and degrees of freedom
print("t statistic:", t)
print("standard error:", se)
print("degrees of freedom:", df)
print("p value for signaficance level .05: ", float(p))
The joined data had a length of 77, meaning that only one datapoint was removed from the original dataset. Therefore, the findings from the joined dataset were roughly the same as in the original (section 2c).
Overall, we see that countries with higher healthcare rankings (meaning a worse healthcare system), have lower happiness scores. This makes sense because having “good” healthcare systems may impact the quality of life in a country, i.e. resulting in higher happiness scores in countries with better healthcare systems. Countries with higher healthcare rankings (worse healthcare systems) also seem to be smaller island countries. Five of the 10 countries with the worst healthcare ratings are small island countries, while only one of the top 10 countries falls into this category. It would be interesting to do further research to see if the size, location, and average temperature of countries have any correlation with healthcare rankings and happiness scores.
Overall, we saw that both healthcare rankings (government institutions) and happiness scores (societal factors) have a relatively strong relationship with the number of coronavirus cases per country. Healthcare rankings have a negative relationship with the number of coronavirus cases. When just examining these two variables, this does not make sense because one would think that countries with better healthcare have less coronavirus cases. Again, when we look at happiness scores and coronavirus cases, we see that countries with higher happiness scores have more coronavirus cases, which is also the opposite of what we might expect; once again, external factors may be at play. However, since we saw that there is a negative relationship between happiness scores and healthcare ranking (i.e. countries with better healthcare had higher happiness scores), it makes sense that we see the same trends between the two variables and corona cases per country.
Within the wider context of real-life application, elements such as happiness scores and healthcare rankings are influenced by a variety of factors, including income, quality of life, freedom, etc. Additionally, such societal factors can directly influence the number of coronavirus cases, as income and quality of life can determine how accessible and how feasible social distancing is for people. Likewise, it makes sense that general healthcare rankings are correlated with coronavirus cases, as countries with higher rankings probably have more emphasis on treatment and guidelines for the citizens. In our data analysis, we analyze the bigger picture, using countries as a whole; however, countries have certain “hotspots”, where people may be gathered and/or cases may be more frequently reported based on the resources that are available. The uneven distribution of wealth and access to healthcare in different regions of countries may explain disparities within countries that may not be seen via the “big picture” from just viewing numbers of the country as a whole. In more developed countries, where resources are more available, the numbers reported may be higher because there is more access to tests and to medical care. This means that our findings when it comes to healthcare and coronavirus cases may prove to be reversed by the end of the pandemic.
Since we did all of our data collection and analysis in the middle of the pandemic, the numbers we collected for coronavirus cases are not the final number of cases in that country. Although we found some preliminary evidence, in order to draw stronger conclusions, it would be useful to re-run the tests with the final numbers once the pandemic is over. Furthermore, due to the widespread lack of testing in some places, some countries may be under represented in the coronavirus cases data. Additionally, the concept of “heteroscedasticity”, applies to many comparisons between in our dataset. In the instances where we took the log of coronavirus cases, the data analyzed had small variances concentrated in one area and large variances concentrated in another, leading to exponential relationships as opposed to linear relationships. Since our data had heteroscedasticity, we decided to analyze the independent variables with respect to the log of the dependent variable to determine what relationships and correlations existed among the data.
With regards to the coronavirus data set, we tried to web scrape the table with all the cases from the site; however, this proved to be difficult because the number of cases scraped from the website did not align with the proper country due to the HTML formatting of the site. Therefore, we chose a different method of obtaining the data instead.
In further studies, we could look at cell phone data to see how well people are following social distancing rules and compare that with the number of cases in the country. Additionally, we can perhaps find similar factors and/or confounding variables regarding government institutions and societal factors that can determine whether one has more of an effect on the number of coronavirus cases. Furthermore, some other considerations include age distribution and population density for countries. Within our data and given the coronavirus information provided to the public, it is suggested that people in higher age demographics are more susceptible to the illness and that countries with higher population densities have more equipped healthcare systems. Such factors can perhaps also explain and affect the number of coronavirus cases for a specific country.
We would like to thank Professor Mimno for giving us feedback throughout the process and looking at different iterations of our projects. Our TAs were also very helping in answering questions that came up. In addition, https://stattrek.com/regression/slope-test.aspx was a helpful resource in figuring out how to run our linear regression hypothesis tests.