COVID-19 has been a prevalent part of the past several months of our lives and as vaccines have come in order to help the future of all individuals it is important to look back and analyze the rollercoaster of a ride its been especially when picking apart the cases, not only specifically in North America but throughout the world over the period of the pandemic so far. Furthermore, consideration of other factors that are influencing the difference in COVID-19 cases in different places are also investigated. Delving into the specifics of this data is crucial at a time where a vaccine is used to help prevent the spread of the virus as well as learning more about the virus itself and what can help stop it.
What is our end goal? We need to work together to make the number of cases go to zero and the only way to make this pivotal change is to understand where a majority of the cases are coming from not only at the current moment but over time and how to prevent spikes in cases. The strict guidelines with social distancing rules must continue to stay strong. We predict that earlier start dates as well as longer lockdown durations allowed for lower amount of total cases. Not only that but we believe that locations with higher poverty rates or population densities are key factors in having a higher positivity rate. There are also many other factors that we consider that could be affecting the spread of the virus.
In our analysis, we will be visualizing the data of COVID-19 cases worldwide, then going specifically into North America where there are many cases. We will not only look at the total cases, but also discuss the amount of new cases over time. Here, we use different sklearn regression models in order to see which ML predicting method was the most accurate. After that, we analyze the different factors that could be potentially affecting the number of cases in an attempt to help reduce the number of cases in North America.
!pip install folium
!pip install plotly
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from sklearn import linear_model as lm
from sklearn import ensemble
from sklearn import svm
from sklearn import tree
from sklearn import neighbors
import folium
from folium import plugins as plugins
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
To analyze the data over time for North America, we will scrape the data and get it daily beginning April 12th. Since the data was for every state (including U.S. territories) we will get the data per day and then concatenate it for each month and save them into separate dataframes. We will use data sourced from github and received COVID-19 cases from a reputable source by the Center of Systems Science and Engineering at Johns Hopkins University: https://github.com/CSSEGISandData/COVID-19 where even more data could be found. Not only that, but there were many sources that had used this data in order to provide a great interactive view of data throughout the world such as: https://ourworldindata.org/coronavirus. We will scrape the latest data that has been documented so far.
This is how we scrape the data for North America (using daily COVID data by state/territories and save them by month):
# This is for the month of April (data only begins from April 12th onwards)
li=[]
for i in range(12, 31):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '04'+'-'+str(i)+'-2020.csv'
url = url + string
li.append(url)
data_america_april = pd.concat([pd.read_csv(url, sep=',', error_bad_lines=False) for url in li])
data_america_april
# This is for the month of May
li=[]
for i in range(1, 10):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '05'+'-0'+str(i)+'-2020.csv'
url = url + string
li.append(url)
for i in range(10, 32):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '05'+'-'+str(i)+'-2020.csv'
url = url + string
li.append(url)
data_america_may = pd.concat([pd.read_csv(url, sep=',', error_bad_lines=False) for url in li])
data_america_may
# This is for the month of June
li=[]
for i in range(1, 10):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '06'+'-0'+str(i)+'-2020.csv'
url = url + string
li.append(url)
for i in range(10, 31):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '06'+'-'+str(i)+'-2020.csv'
url = url + string
li.append(url)
data_america_june = pd.concat([pd.read_csv(url, sep=',', error_bad_lines=False) for url in li])
data_america_june
# This is for the month of July
li=[]
for i in range(1, 10):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '07'+'-0'+str(i)+'-2020.csv'
url = url + string
li.append(url)
for i in range(10, 32):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '07'+'-'+str(i)+'-2020.csv'
url = url + string
li.append(url)
data_america_july = pd.concat([pd.read_csv(url, sep=',', error_bad_lines=False) for url in li])
data_america_july
# This is for the month of August
li=[]
for i in range(1, 10):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '08'+'-0'+str(i)+'-2020.csv'
url = url + string
li.append(url)
for i in range(10, 32):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '08'+'-'+str(i)+'-2020.csv'
url = url + string
li.append(url)
data_america_aug = pd.concat([pd.read_csv(url, sep=',', error_bad_lines=False) for url in li])
data_america_aug
# This is for the month of September
li=[]
for i in range(1, 10):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '09'+'-0'+str(i)+'-2020.csv'
url = url + string
li.append(url)
for i in range(10, 31):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '09'+'-'+str(i)+'-2020.csv'
url = url + string
li.append(url)
data_america_sept = pd.concat([pd.read_csv(url, sep=',', error_bad_lines=False) for url in li])
data_america_sept
# This is for the month of October
li=[]
for i in range(1, 10):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '10'+'-0'+str(i)+'-2020.csv'
url = url + string
li.append(url)
for i in range(10, 32):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '10'+'-'+str(i)+'-2020.csv'
url = url + string
li.append(url)
data_america_oct = pd.concat([pd.read_csv(url, sep=',', error_bad_lines=False) for url in li])
data_america_oct
# This is for the month of November
li=[]
for i in range(1, 10):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '11'+'-0'+str(i)+'-2020.csv'
url = url + string
li.append(url)
for i in range(10, 31):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '11'+'-'+str(i)+'-2020.csv'
url = url + string
li.append(url)
data_america_nov = pd.concat([pd.read_csv(url, sep=',', error_bad_lines=False) for url in li])
data_america_nov
# This is for the month of December
li=[]
for i in range(1, 10):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '12'+'-0'+str(i)+'-2020.csv'
url = url + string
li.append(url)
for i in range(10, 20):
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/'
string = '12'+'-'+str(i)+'-2020.csv'
url = url + string
li.append(url)
data_america_dec = pd.concat([pd.read_csv(url, sep=',', error_bad_lines=False) for url in li])
data_america_dec
Now, we will grab data for worldwide COVID-19 cases. This will be scraped by having a tsv file from https://github.com/owid/covid-19-data/tree/master/public/data which is data maintained by Our World in Data and is updated daily. This data will be taken as a tsv file and there will be minor modifications done to the data such as converting the date columns into datetime types as well as adding a column for months. This is how we will scrape the worldwide data:
read_tsv = 'worldwide-covid-data.tsv'
data_worldwide = pd.read_csv(read_tsv, sep='\t', error_bad_lines=False)
data_worldwide['datetime'] = pd.to_datetime(data_worldwide['date'])
data_worldwide['month'] = pd.DatetimeIndex(data_worldwide['datetime']).month
data_worldwide
This is how the COVID-19 data would be taken with total cases around the world by location. In this, we are receiving data from https://coronavirus.jhu.edu/map.html which also provides a dashboard for COVID-19 data throughout the world with the most up to date information. In this, we will only receive the data at the most updated date and get the total cases up to that point for all the locations. This data would be used in order to correlate different factors such as lockdown dates with the total cases by location. We will import a csv file and read in the data for this. This is how we will scrape this data:
read_tsv = 'covid-all-data.csv'
covid_all_data = pd.read_csv(read_tsv, sep=',', error_bad_lines=False)
covid_all_data = covid_all_data[covid_all_data['date'] == '2020-12-18']
covid_all_data.drop(covid_all_data.columns.difference(['location', 'total_cases']), 1, inplace=True)
covid_all_data.rename(columns={'location': 'Country'}, inplace=True)
covid_all_data.reset_index(drop=True, inplace=True)
covid_all_data
Here, we will be looking at the lockdown dates (specifically the start and end times) by location. We will be using a tsv file and reading that in. We modify the data slightly by not using certain columns and converting the date columns into datetime objects. This data is from https://auravision.ai/covid19-lockdown-tracker/ which is also a great resource in order to look at the lockdowns throughout the world and different ways it could be visualized. This is how we scrape this data:
read_tsv = 'lockdown_dates.tsv'
data_worldwide_lockdown = pd.read_csv(read_tsv, sep='\t', error_bad_lines=False)
data_worldwide_lockdown = data_worldwide_lockdown.drop(columns=['url'])
data_worldwide_lockdown['Start date'] = pd.to_datetime(data_worldwide_lockdown['Start date'])
data_worldwide_lockdown['End date'] = pd.to_datetime(data_worldwide_lockdown['End date'])
data_worldwide_lockdown
At this point, we will merge the two dataframes we have which will merge the total cases based on location with the start and end lockdown dates using pandas functions.
combined_data = pd.merge(data_worldwide_lockdown, covid_all_data, on='Country')
combined_data
Now that we have our data, we will firstly look at it on a large scale. Let’s compare the amount of average cases for each country in a continent. First, to do this, we can find the average amount of cases for each continent based on the worldwide data using the groupby function. This will give us a table that we can then use to plot and see visually where each continent differs in comparison to the rest.
#so here we find the average of the new cases in each continent
data_worldwide_avg = data_worldwide.groupby(data_worldwide['continent']).mean().reset_index()
#but what does it look like? let's plot it.
plt.plot(data_worldwide_avg['continent'], data_worldwide_avg['total_cases'])
plt.xlabel('continent')
plt.ylabel('# of total cases')
plt.title('average total case across continents')
So using this, we can compare that the amount of total cases there are currently for each continent. From this we can see that Africa and Oceania both have significantly low levels of total cases in each county compared to the other continents. We can also see that North America and South America has higher total amount of cases, probably due to higher amount of interactions among individuals who tested positive.
Next, we will also look at the rate at which the amount of cases grew in each continent. We will do this by plotting time on the x-axis as a factor of months. In order to acheive this, we need to seperate each continent from the worldwide data and find the means of each per month. We can then plot this information.
#splitting the data by continent
df = data_worldwide.groupby(by = data_worldwide['continent'], as_index=False)
#setting the months as numerical values such that January corresponds to 1 and December to 12
months = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
#for each continent we will find the mean per month and plot it
Africa = df.get_group('Africa')
Africa_means = Africa.groupby(by = ['month']).mean()
Africa_means['month'] = months
plt.plot(Africa_means['month'], Africa_means['total_cases'], label = "Africa")
Asia = df.get_group('Asia')
Asia_means = Asia.groupby(by = ['month']).mean()
Asia_means['month'] = months
plt.plot(Asia_means['month'], Asia_means['total_cases'], label = "Asia")
Europe = df.get_group('Europe')
Europe_means = Europe.groupby(by = ['month']).mean()
Europe_means['month'] = months
plt.plot(Europe_means['month'], Europe_means['total_cases'], label = "Europe")
North_America = df.get_group('North America')
North_America_means = North_America.groupby(by = ['month']).mean()
North_America_means['month'] = months
plt.plot(North_America_means['month'], North_America_means['total_cases'], label = "North_America")
Oceania = df.get_group('Oceania')
Oceania_means = Oceania.groupby(by = ['month']).mean()
Oceania_means['month'] = months
plt.plot(Oceania_means['month'], Oceania_means['total_cases'], label = "Oceania")
South_America = df.get_group('South America')
South_America_means = South_America.groupby(by = ['month']).mean()
South_America_means['month'] = months
plt.plot(South_America_means['month'], South_America_means['total_cases'], label = "South_America")
plt.xlabel('month')
plt.ylabel('# of total cases')
plt.title('average total case across time')
plt.legend()
From this line graph, we can see how the continents compare over time with respect to total cases. We can see that over time, both Africa and Oceania don't increase drastically, which supports the fact that we saw these two continents with the lowest amount of total average cases. Same with the vice versa of North America and South America. This is also interesting because we can see when cases started to pick up for certain countries. For example, cases started to increase in May-June for the Americas, while for Europe, we see it drastically increasing after October.
Next, we will be visualizing data of total cases throughout the world and this can be best visualized using a chloropleth map which should easily be a way in order to look at the data worldwide and see the number of COVID-19 cases. This can be done with a time slider as well in order to better ensure the fact that there can be a visualization of the daily change in COVID-19 cases throughout the world.
#Legend updates as the number of cases change
fig = px.choropleth(data_frame = data_worldwide,
locations= "iso_code",
color= "total_cases",
hover_name= "location",
color_continuous_scale= 'sunset',
animation_frame= "date")
fig.show()
It is clear that though China starts off as the highest number of cases, over time it significantly decreases quickly posssibly because they were in lockdown for awhile. Next, America, Brazil, and India definitely slowly begin to increase and then around April it begins to increase quite rapidly everywhere and it stays quite high with a total number of cases everywhere. America definitely seems to be on the higher part of the spectrum.
So now we have seen how the continents compare. Let's be more specific and look at a smaller (yet still big) sample: North America. So to begin with, we will look at North America's timeline. Using this timeline, we will run regression models on it to see which fits the best.
#extracting data for north america from the worldwide data
df = data_worldwide.groupby(by = data_worldwide['continent'], as_index=False)
North_America = df.get_group('North America')
#so let's plot the average monthly total cases
North_America_means = North_America.groupby(by = ['month']).mean()
North_America_means['month'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
plt.scatter(North_America_means['month'], North_America_means['total_cases'])
#now, using this, we can find a regression curve to predict.
#Except how do we know which regression to use?
#let's try a couple:
#first we define our variables, x as the month, y as the total cases
x = np.array(North_America_means['month'])
x = np.reshape(x, (x.size, 1))
y = np.array(North_America_means['total_cases'])
#Random Forest regression model
reg = ensemble.RandomForestRegressor()
reg.fit(x, y)
curve = reg.predict(x)
plt.plot(curve, label = "Random Forest")
#Linear SVM regression model
reg = svm.LinearSVR()
reg.fit(x, y)
curve = reg.predict(x)
plt.plot(curve, label = "Linear SVM")
#Linear regression model
reg = lm.LinearRegression()
reg.fit(x, y)
curve = reg.predict(x)
plt.plot(curve,label = "Linear")
#Decision tree regression model
reg = tree.DecisionTreeRegressor()
reg.fit(x, y)
curve = reg.predict(x)
plt.plot(curve, label = "Tree")
#then we will plot each and see which is the best predictor
plt.xlabel('month')
plt.ylabel('# of total cases')
plt.title('average total case across time in North America')
plt.legend()
So, now we have some regression models for total cases over time in North America. For Linear SVM, we can clearly see that the prediction does not nearly match the actual scattered data points, as it remains at zero thoughout the year. For Linear, we see that it starts at the actual data and ends at the actual data, however, in the middle months, it does not accurately represent the actual data. We then have Tree and Random Forest left, which both seems to be an accurate representation. However, the blue line representing Random Forest still follows closer to the actual data than the Decision Tree regression. Thus, we can see that the Random Forest model best predicts our data.
Another thing in our analysis that we can do is compare the cases on a state level. Here we already have data on North America and visually, we will create a map to represent the total cases in each state over time.
#drops duplicates, but keeps last occurence in the table, which represents the latest updated one
data_amer_april = data_america_april.drop_duplicates(subset = ['Province_State'],keep='last')
data_amer_may = data_america_may.drop_duplicates(subset = ['Province_State'],keep='last')
data_amer_june = data_america_june.drop_duplicates(subset = ['Province_State'],keep='last')
data_amer_july = data_america_july.drop_duplicates(subset = ['Province_State'],keep='last')
data_amer_aug = data_america_aug.drop_duplicates(subset = ['Province_State'],keep='last')
data_amer_sept = data_america_sept.drop_duplicates(subset = ['Province_State'],keep='last')
data_amer_oct = data_america_oct.drop_duplicates(subset = ['Province_State'],keep='last')
data_amer_nov = data_america_nov.drop_duplicates(subset = ['Province_State'],keep='last')
data_amer_dec = data_america_dec.drop_duplicates(subset = ['Province_State'],keep='last')
data_amer_april
#a map for total cases by month
monthly_map1 = folium.Map(location=[48, -102], zoom_start=3)
from folium.plugins import FastMarkerCluster
#for each month and state, we will add a marker to the map with radius as the total cases
april = folium.map.FeatureGroup(name="april").add_to(monthly_map1)
for index, row in data_amer_april.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['Confirmed'],
popup = row['Confirmed'],
fill = True,
color='orange',
fill_opacity=0.05
).add_to(april)
may = folium.map.FeatureGroup(name="may").add_to(monthly_map1)
for index, row in data_amer_may.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['Confirmed'],
popup = row['Confirmed'],
fill = True,
color='yellow',
fill_opacity=0.05
).add_to(may)
june = folium.map.FeatureGroup(name="june").add_to(monthly_map1)
for index, row in data_amer_june.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['Confirmed'],
popup = row['Confirmed'],
fill = True,
color='white',
fill_opacity=0.05
).add_to(june)
july = folium.map.FeatureGroup(name="july").add_to(monthly_map1)
for index, row in data_amer_july.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['Confirmed'],
popup = row['Confirmed'],
fill = True,
color='purple',
fill_opacity=0.05
).add_to(july)
aug = folium.map.FeatureGroup(name="aug").add_to(monthly_map1)
for index, row in data_amer_aug.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['Confirmed'],
popup = row['Confirmed'],
fill = True,
color='pink',
fill_opacity=0.05
).add_to(aug)
sept = folium.map.FeatureGroup(name="sept").add_to(monthly_map1)
for index, row in data_amer_sept.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['Confirmed'],
popup = row['Confirmed'],
fill = True,
color='gray',
fill_opacity=0.05
).add_to(sept)
octo = folium.map.FeatureGroup(name="oct").add_to(monthly_map1)
for index, row in data_amer_oct.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['Confirmed'],
popup = row['Confirmed'],
fill = True,
color='crimson',
fill_opacity=0.05
).add_to(octo)
nov = folium.map.FeatureGroup(name="nov").add_to(monthly_map1)
for index, row in data_amer_nov.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['Confirmed'],
popup = row['Confirmed'],
fill = True,
color='#3186cc',
fill_opacity=0.05
).add_to(nov)
dec = folium.map.FeatureGroup(name="dec").add_to(monthly_map1)
for index, row in data_amer_dec.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['Confirmed'],
popup = row['Confirmed'],
fill = True,
color='green',
fill_opacity=0.05
).add_to(dec)
folium.LayerControl().add_to(monthly_map1)
monthly_map1
Obviously, as time goes on, we see that the radius for each circle in each state gets significantly bigger. This makes sense because it adds on from the previous month’s circle, thus getting bigger. One important thing to note is the fact that from April to December, we see such a drastic increase in the total amount of cases in each state. However, if we see the increases in cases month by month, we can see that from October to November, there was a momentous increase, rather than the gradual increase we saw monthly. This suggests that during this time, there was a peak, which we can see in the graphs above, where North America has a steeper increase during this time.
Now, when looking at the above map, we see the total cases increase a substantial amount, and this makes sense since we are adding the previous total cases with the current cases. However, we will also look at the new cases statewide. Only then can we analyze whether or not cases are actually decreasing over time, is it continuing with the trend.
#a map for new cases by month
monthly_map2 = folium.Map(location=[48, -102], zoom_start=3)
#we can find the new cases by taking this month's total cases and subtracting
#last month's total cases from it
data_amer_april['new cases'] = data_amer_april['Confirmed']
data_amer_may['new cases'] = data_amer_may['Confirmed'] - data_amer_april['Confirmed']
data_amer_june['new cases'] = data_amer_june['Confirmed'] - data_amer_may['Confirmed']
data_amer_july['new cases'] = data_amer_july['Confirmed'] - data_amer_june['Confirmed']
data_amer_aug['new cases'] = data_amer_aug['Confirmed'] - data_amer_july['Confirmed']
data_amer_sept['new cases'] = data_amer_sept['Confirmed'] - data_amer_aug['Confirmed']
data_amer_oct['new cases'] = data_amer_oct['Confirmed'] - data_amer_sept['Confirmed']
data_amer_nov['new cases'] = data_amer_nov['Confirmed'] - data_amer_oct['Confirmed']
data_amer_dec['new cases'] = data_amer_dec['Confirmed'] - data_amer_nov['Confirmed']
data_amer_dec
from folium.plugins import FastMarkerCluster
#for each month and state, we add a marker to the map with the radius being new cases
april = folium.map.FeatureGroup(name="april").add_to(monthly_map2)
for index, row in data_amer_april.iterrows():
#doing this for each state in the month
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['new cases'],
popup = row['new cases'],
fill = True,
color='orange',
fill_opacity=0.05
).add_to(april)
may = folium.map.FeatureGroup(name="may").add_to(monthly_map2)
for index, row in data_amer_may.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['new cases'],
popup = row['new cases'],
fill = True,
color='yellow',
fill_opacity=0.05
).add_to(may)
june = folium.map.FeatureGroup(name="june").add_to(monthly_map2)
for index, row in data_amer_june.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['new cases'],
popup = row['new cases'],
fill = True,
color='white',
fill_opacity=0.05
).add_to(june)
july = folium.map.FeatureGroup(name="july").add_to(monthly_map2)
for index, row in data_amer_july.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['new cases'],
popup = row['new cases'],
fill = True,
color='purple',
fill_opacity=0.05
).add_to(july)
aug = folium.map.FeatureGroup(name="aug").add_to(monthly_map2)
for index, row in data_amer_aug.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['new cases'],
popup = row['new cases'],
fill = True,
color='pink',
fill_opacity=0.05
).add_to(aug)
sept = folium.map.FeatureGroup(name="sept").add_to(monthly_map2)
for index, row in data_amer_sept.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['new cases'],
popup = row['new cases'],
fill = True,
color='gray',
fill_opacity=0.05
).add_to(sept)
octo = folium.map.FeatureGroup(name="oct").add_to(monthly_map2)
for index, row in data_amer_oct.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['new cases'],
popup = row['new cases'],
fill = True,
color='crimson',
fill_opacity=0.05
).add_to(octo)
nov = folium.map.FeatureGroup(name="nov").add_to(monthly_map2)
for index, row in data_amer_nov.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['new cases'],
popup = row['new cases'],
fill = True,
color='#3186cc',
fill_opacity=0.05
).add_to(nov)
dec = folium.map.FeatureGroup(name="dec").add_to(monthly_map2)
for index, row in data_amer_dec.iterrows():
if pd.notna(row['Long_']) and pd.notna(row['Lat']):
folium.Circle(
location= [row['Lat'],row['Long_']],
radius = row['new cases'],
popup = row['new cases'],
fill = True,
color='green',
fill_opacity=0.05
).add_to(dec)
folium.LayerControl().add_to(monthly_map2)
monthly_map2
We can see that the circles and their corresponding radius are much smaller than that of the first monthly map. This is because we look at only the amount of new cases that month. From the month of April, we see that New York has a big circle, with 304372 cases (we can click on the circle to tell us how many cases there are). Then in May, we see 66398 new cases. This is much smaller than that of April, as in April, we were not prepared for this pandemic, but in May, we have started to fight it with emergency measures. The fact that there are much lower cases in May suggests that what we did in April to curb the amount of cases has relatively worked. However, in November, the big circles in many states suggest a resurgence of the virus. The reason for this might be the fact that many quarantine measures might have been lifted from the amount of low new cases since May to October, thus causing a renewal in cases. In addition, it was Thanksgiving, which generally gathers many people together. However, in December, we see the cases be significantly smaller than any other month. One reason for this is because it is currently December, so as opposed to other months, this December was a third shorter. Another reason why this could be very small is due to the fact that we saw a big peak in the amount of cases in November, which suggests we needed more safety measures. Hopefully December statistics stay small, as it only shows the numbers from before the upcoming holiday season. From this map, we can conclude that even if there are low cases, it does not mean that we can be carefree, as the amount of cases can just jump right back up.
Okay, so we have seen the numbers of cases worldwide and more specifically, North America. We know that there are many factors that can come into play when talking about COVID-19, so we will begin to look at factors. One factor being location. From the first few line maps earlier, we saw that different continents have differing statistics on total cases. So, perhaps location is a factor on COVID-19 cases. We can check this by creating interaction terms, one for each of the continents, and see how it affects the number of cases. We will also graph the actual data in terms of cases and the predicted data based on the interaction terms to more easily compare.
#adding dummies for continents
terms = pd.get_dummies(data_worldwide, columns = ['continent'])
terms['term_Africa'] = terms['continent_Africa']*terms['month']
terms['term_Asia'] = terms['continent_Asia']*terms['month']
terms['term_Europe'] = terms['continent_Europe']*terms['month']
terms['term_North America'] = terms['continent_North America']*terms['month']
terms['term_Oceania'] = terms['continent_Oceania']*terms['month']
terms['term_South America'] = terms['continent_South America']*terms['month']
terms
#picking what columns from terms we want in our fit
terms = terms[terms['total_cases'].notna()]
term = ['month', 'term_Africa', 'term_North America','term_South America','term_Asia',
'term_Europe', 'term_Oceania', 'continent_Africa', 'continent_North America',
'continent_South America','continent_Asia', 'continent_Europe', 'continent_Oceania']
#setting our variables for the regression
x = terms[term]
y = terms['total_cases']
#creating our regression
reg = lm.LinearRegression()
reg.fit(x, y)
#getting our prediction with interaction terms
prediction = reg.predict(terms[term])
terms['pred'] = prediction
coef = reg.coef_
intercept = reg.intercept_
#printing out our resulting equation with our coefficients
print(f'total cases = {intercept} + {coef[0]}{term[0]} + \
{coef[1]}{term[1]} + {coef[2]}{term[2]} + {coef[3]}{term[3]} + \
{coef[4]}{term[4]} + {coef[5]}{term[5]} + {coef[6]}{term[6]} + \
{coef[7]}{term[7]} + {coef[8]}{term[8]} + {coef[9]}{term[9]} + \
{coef[10]}{term[10]} + {coef[11]}{term[11]} + {coef[12]}{term[12]}')
#now we can plot actual vs predicted average total cases
total_cases = terms.groupby(['month'])['total_cases'].mean()
pred_cases = terms.groupby(['month'])['pred'].mean()
plt.plot( total_cases, label = 'actual')
plt.plot(pred_cases, label = 'pred')
plt.xlabel('month')
plt.ylabel('# total')
plt.title('actual vs predicted based on interaction terms')
plt.legend()
So from the above equation with the interaction terms, we see that the dummy variables start with “continent” and the interaction terms start with “term”. The equation tells us that the continent does make a drastic difference in the amount of cases total. For example, earlier we saw that North America has much more cases than Oceania, and we can see this difference with the coefficients: 9977439.729240108continent_North America and 10253948.31146027continent_Oceania. In addition, using this interaction term model and looking at the graph, the actual versus the predicted data is relatively similar. This shows that the continents do have an effect on the amount of total cases. In addition, it does intersect at two points, thus showing that for the months of April and October, the interaction term was completely accurate.
Another factor that we can look at is the lockdown period. We will plot the lockdown periods and the amount of cases for that period (in each area) to see how quarantine affects COVID-19 cases.
#We only want one of each area of the lockdown dates and totals, so we
#drop duplicates
combined_data = combined_data.drop_duplicates(subset = ['Country'])
combined_data
#setting the figure size and choosing dot colors
plt.figure(figsize=(10,10))
colors = np.random.rand(len(combined_data))
#scattering the data points based on start and end lockdown dates, with radius of total cases (all divided by 5000)
plt.scatter(combined_data['Start date'], combined_data['End date'], s=(combined_data['total_cases'])/5000, c=colors)
plt.xlabel('start date')
plt.xticks(rotation=90)
plt.ylabel('end date')
plt.title('total cases with respect to start lockdown date')
Based on this previous graph, we can see that, unlike our hypothesis, there is no direct correlation between starting lockdown earlier and for longer periods of time. As there are multiple dots of differing sizes, which shows the amount of cases in relation to the other dots, there is no obvious trend. Many areas start lockdown around mid-March to April, and end anywhere from April to July, and they are all relatively close in size. There is an outlier in this timeframe however, where it is much larger than the rest. As this outlier is in the time frame as other areas, we see that when starting lockdown and when ending lockdown does not have an effect on the total number of cases.
Here, we will discuss the correlation of different factors such as population density, older individuals (70 or older), extreme poverty, cardiovascular death rate, smokers, as well as handwashing facilities and their correlation with total and new cases as well as deaths. This allows for an understand to see where it seems to be strongly correlated whereas others where it is not. Without having this, there are many who have incorrect assumptions of correlations that may not exist and having data to see it would definitely be beneficial.
#Have the x-values with the heatmap dataframe
heatmap_x = ['total_cases', 'new_cases', 'total_deaths', 'new_deaths', 'new_tests', 'positive_rate', 'tests_units']
data_worldwide_x = data_worldwide[heatmap_x]
data_worldwide_x
#Have the y-values with the heatmap dataframe
heatmap_y = ['population_density', 'aged_70_older', 'extreme_poverty', 'cardiovasc_death_rate', 'handwashing_facilities']
data_worldwide_y = data_worldwide[heatmap_y]
data_worldwide_y['smokers'] = data_worldwide['female_smokers'] + data_worldwide['male_smokers']
data_worldwide_y
#Correlation map with x and y respective values
corrx = data_worldwide_x.corr()
corry = data_worldwide_y.corr()
sns.heatmap(corrx, xticklabels=corrx.columns, yticklabels=corry.columns)
From this, it is clear to see analyze that there is a high correlation with population density with number of cases, deaths, but not positivity rate. This makes sense as we know COVID-19 can be spread easily through contact bewteen people. Not only that, but there is a high correlation with older individuals (70 or older), extreme poverty, cardiovascular death rate, and handwashing facilities with everything but positivity rate. This also makes sense as we know COVID-19 has less of a prevalence with younger and healthier people. Areas with many smokers do not have a high correlation rate with cases except they do have a high correlation with positivity rates which is very interesting. Overall, this shows that more focus needs to go onto areas with high population density, extreme poverty, older individuals, especially with the vaccine and preventing spikes.
After looking more into detail of the specifics of COVID-19 cases, it is clear that America had many COVID-19 cases, and not only that but worldwide, the start lockdown dates do not seem to have a direct correlation with a higher amount of total cases as we initially predicted. However, another one of our initial predictions regarding population density, poverty, older inviduals, and handwashing facilities all seemed to have a high amount of correlation to total cases.
What can really be seen is that looking at the worldwide map of COVID-19 cases over time, there seemed to be a high amount of cases in China, however they had quickly come down compared to America which rose up and consistently stayed high. Then it start going lower and then spiked up again recently. Our goal with this analysis is to figure out how to lower the amount of COVID-19 cases and eventually bring it to 0- so what did China do in order to lower their cases and what can America do to attempt to implement a similar thing? Definitely a lockdown seems to be a valuable option, although an earlier start for a lockdown did not seem to have a high correlation with the low amount of total cases, but it may allow for a better situation with new cases.
Analyzing COVID-19 data first throughout the world based on continents and then based on time we feel as though gave a great visualization in order to really understand what was happening and at what time periods certain places had a spike and why. This will also help gauge when to release the vaccine in those locations and before what time it would be important to do so. Not only that, but looking at America per month and visualizing the new cases allowed to see for what reasons there would be spikes and it is clear that during holiday seasons when family and friends are visiting it spikes (like november in the new cases). There are many factors that cause for spikes, and though there is a vaccine, it may not get to everyone soon enough and before losing more lives it is important to understand the pattern and see where we can improve, happy holidays everyone!