Analysis of Covid and related policy data in the United States

Shangyu Lai

1 introduction

It is the end of 2021, but unfortunately the pandemic of COVID-19 has not yet ended. Even worse, the virus is still mutating and spreading. Especially, the Omicron mutaion is discovered and the virus could spread faster than ever. The vaccine is not that effective for the Omicron mutation. In this situation, personal effort is not enough to prevent the virus from spreading. Impact of governments is strongly needed to control the pandemic. In this tutorial, I will analyze the data of the government policies and the data of the COVID-19. I will try to answer the following questions:

  1. How to define strigency of the policies?
  2. How the strigenties of the policies are related to the COVID-19?
  3. How to visualize the data to show the relationship between the policies and the COVID-19?

2 Data collection and description

I collect all the data from the following github reposities or links. All the data I used was updated in Dec 11th, 2021.

2.1 Vaccinations data:

Vaccinations data is from the CDC, COVID-19 Vaccinations in the United States,County. From this dataset we can know the number of people from each county who have been vaccinated for the COVID-19. The data includes vaccination date, recipents state and county, and number of people who have been fully vaccinated, etc.. The format of date is MM/DD/YYYY.

2.2 Policy and covid data:

Policy data is from covid-policy-tracker. Some basic information about COVID-19 like total cases and deaths are embeded in this dataset (Covid data is from COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins UniversityJohns Hopkins University Center for Systems Science and Engineering (JHU CSSE) ). In this dataset it defines strigency $SI=\frac{1}{9}(\sum\limits_{j=1}^{8}{Cj}+H1)$. From $C1$ to $C8$, the meaning of each number is as follows:

ID Description

$C1$

Record closings of schools and universitiesTotal cases

$C2$

Record closings of workplaces

$C3$

Record cancelling public events

$C4$

Record limits on gatherings

$C5$

Record closing of public transport

$C6$

Record orders to "shelter-in-place" and otherwise confine to the home

$C7$

Record restrictions on internal movement between cities/regions

$C8$

Record restrictions on international travel

$H1$

Record restrictions on gatherings

For more details about the data, please refer to the codebook.

2.3 timeline:

I get information about the timeline from CDC. The timeline is from Dec 12, 2019 to Apr 23, 2021. By this data, I can roughly know what happened during the pandemic. It may provide some useful information to understand how Covid and policies are impacted by each other.

3 Data Processing and Visualization

For the first step, I want to visualize the data to show the relationship between the policies and the COVID-19. Then we can get a rough answer for the following questions:

  1. How Covid-19 progress in the US?
  2. How strigency changes with time?
  3. Any relationship between the policies and the spread of Covid-19?
  4. Any relationship between the policies and the death rate?
  5. Any relationship between the policies and the number of cases?
  6. Any relationship between the policies and the number of vaccinations?

I want to merge the vaccination data with the polycy data so that I only need to process one dataframe. I will use the following steps to do this:

3.1 plot the data in a general scale using line chart

After some basic data processing, I can plot the data the know the rough situation of the pandemic in the US. I plot four line charts to show the number of confirmed cases, confirmed deaths, the number of vaccinations, and the stringency of the policies.

3.2 plot the data in a general scale of US using animation

After ploting out the four values, I can see some important time series that the policies are chaning. The first one is between March and April, 2020. We can see that the stringency is suddently increasing. The second one is between July and November. Although the number of cases and deaths was steadily increasing, the strigency of policies was decreasing. As a result, it was a incredible increase around December 2020. The third one is between February and July 2021. As the number of vaccinations increasing, the cases and deaths grew slower and the stringency of policies was decreasing. The fourth one is around December 2021. There is a increasing trend of the cases and stringency of policies.

Then, I want to take a closer look at the data to see what is the more exact date of the change. What is the related information in the timeline at those periods. I plot the four lines animation to show the change of date directly. I will use Pandas Alive to show the animation.

Through the animation, I can the the first important date is around March 10. Looking it up in the timeline data, we can see on March 10 WHO announced that COVID-10 a pandemic. On March 13 "President Donald J. Trump declares a nationwide emergency". On March 15 US states started shuting down. As a result, the stringency increased rapidly. During the second period, as the presidential election was pushing forward, Trump was tested positive on Oct 2,2020. Soon enough, Trump anounced that he was recovered. So the stringency decreased a little bit. On Nov 3 the election occured. We can observe the tangent line of the confirmed cases changed rapidly around that time, which means the rate of cases increased rapidly. Right after the election, the stringency was increading again. In the third period, vaccines for COVID-19 came out. As number of vaccination was increasing, the rate of cases and deaths decreased then the stringency of the policies decreased. In the fourth period, the Omicron mutation was detected. There is an increasing trend of the cases and the stringency of the policies.

At this point, I would like to make an assumption that the stringency of policies is related to the increasing rate of cased and the coverage of vaccination. The government tends to control the rate of cases. This is a general assumption. Then I will visualize the data of most states to further explore the relationship between the policies and the COVID-19.

3.3 plot the data across the US states using bar chart and geo map in animation

For the next step, I try ploting all the data for every state in the US animatively. To do this I needs several dataFrames, which have states as columns and date as index. Each dataFrame represents an element (ConfirmedCases, ConfirmedDeaths etc.).

After doing the above steps, I can plot multiple bar charts. Then I want to plot them together with geometry information. So I downloaded the shapefile of the US from US Census. I also use geopandas to read the shapefile. Then I used EPSG:3081 to convert the coordinate system. I removed the states GU MP NV AS HI, since they are too small to be shown.

Before plotting, I need to convert the DataFrames into GeoDataFrames embeded geometry information. The first step is transposing the dataFrames to have states as index and date as columns. Then I convert geometry information into a dictionary. Using the dictionary, I assign the geometry information to the states. Then I convert the GeoDataFrames into GeoDataFrames. I want to plot the data in a $2\times 2$ spacegrid. the code is as follows:

Since titles of each subplots could be set, the following table shows the titles of each subplots.

Confirmed Cases Confirmed Deaths
Completed Vaccinations Stringency of States

In the plot of Stringency of States, the color represents the value. Deeper the color, the higher the stringency.

From this animation, I can further confirmed my assumption that the stringency of policies is related to the increasing rate of cased and the number of vaccinations. I believe it could also be correct for most states. To make a further observation, I plot the four values separately using geo map.

This is how the cases progress in the US.

This is how the deaths progress in the US.

This is how the stringency changes in the US.

This is the animation of the number of completed vaccinations in the US.

4 Numerical Analysis

4.1 basic observation of numerical analysis

To confirm my assumption that the stringency of policies is related to the increasing rate of cases and the number of vaccinations, I will use the following steps to do this:

  1. define a patition $P$, which divides the data into $n$ parts.
  2. $dates = date_n-date_0$
  3. rate of cases or rate of deaths in $P$ can be defined as the rate of average of the data in the $P$ part. For instance, $R_{Ci} = \frac{{C_i}-{C{i-1}}}{Date_i-Date_{i-1}}, i\ge 1, i < n$
  4. In patition $P$, the strigency can be defined as the average each patition $[i-1,i]$. $S_i = \frac{\sum\limits_{j=date_{i-1}}^{date_i}S[j]}{date_i-date_{i-1}}$

Then, I want to calculate the correlation of different $n$

By the correlation coefficient, it shows that rate of case and rate of deaths have a positive relationship of stringency. The vaccination number has a negative relationship of stringency. So as rate of cases and rate of deaths increase, the stringency of policies also increases. When the number of vaccination increases, the stringency of policies decreases. Another interesting obervation is that as the number of vaccination increases, the rate of cases and rate of deaths decrease.

4.2 further observation of numerical analysis

For further analysis, I want to make two kind of regression models to fit the data. One is linear. The other is non linear.

  1. define a function of stringency in terms of rate of cases, rate of deaths, and number of vaccination. $S = c\cdot X, X=\begin{pmatrix}{R_c \\ R_d \\ V}\end{pmatrix}$
  2. define a function of rate of cases in terms of stringency and number of vaccination. $R_c = c\cdot X, X=\begin{pmatrix}{S \\ V}\end{pmatrix}$
  3. for non linear regression, use SVR. or in math $Ac=y$. $A$ is a matrix.

Then I can calculate different models using different partition $P$, which means different $n$.

Then I want to plot the results of the regression models and calculate the error, or say loss using $$Loss=\frac{1}{2}||\hat{y}-y||$$

Conclusion and Future Work

By the regression models, I can confirm my assumption that the stringency of policies is related to the increasing rate of cases and the number of vaccination. However, the number of completed vaccination and the stringency of policies don't affect much for the rate of cases and the rate of deaths. This is the most odd observation. To further investigate this observation as well as my assumption, I think a null hypothesis test should be used. Also, I can apply the same numerical analysis to the states in the US. In addition, the stringency of policies could also be related to the data of other countries.

So far I can conclude that the stringency in the US, without digging into each states, is driven by the rate of cases, rate of deaths, and number of completed vaccination.