NYC COVID-19 Ethnicities Analysis

If you do not live in a cave in the Scottish Highlands or a turf hut in Iceland. The Coronavirus pandemic has not surpassed affecting your daily life. It has been the main source for headlines throughout the global media. In late March/early April of 2020 the eyes of global media were fixed on the United States as the coronavirus pandemic was spreading faster than anyone would have anticipated. National, as well as the local government, have been highly criticized for not taking quick enough action against the spread of COVID-19 earlier (Obama says White House response to coronavirus has been 'absolute chaotic disaster, US's global reputation hits rock-bottom over Trump's coronavirus response]).

The epicenter of the pandemic in The USA is in New York City. The 15th of March 2020 10 people had already died in NYC, fast forward to the 15th of May and the death toll had soared to around 87000 deaths.

In the midst of the early crisis, news headlines and reports stated that people of *African-American* and *Hispanic* ethnicities were being hit harder by the crisis than any other races.

While other sources sought to link the spread to each individual's socioeconomic status.

The widespread sensationalist headlines and lack of clarity might sway the understanding that the public has of why some ethnicities are hit harder than others. The goal of this interactive data visualization is to shed light on different factors that might be linked to the spread of COVID-19, in an easily digestible way.

It is shown that ethnicities are strongly linked with socio-economic status, thus separating the effects is not possible. However, better understanding each of the individual factor's correlation to the spread of the virus will make matters less sensationalist and more grounded in sound reasoning.

Timeline Overview

The first case of COVID-19 was confirmed in New York City (NYC) the 1st of March, the testing begins. The USA and NYC have officially entered the beginning of one of the worst crises since the great depression in 1929.
A week later Andrew Cuomo, the governor of NYC, declares a state of emergency. Shortly after a number of commuter guidelines, containment zones and lockdowns are issued.
The interactive data visualization journey starts by showing the governmental timeline of NYC and the changes in number of new COVID-19 cases each day.

In the first visualization a number of vertical lines can be observed, each of them indicates an important event in the timeline of NYC and its fight against COVID-19. By hovering over the vertical lines, it is possible to read about the major events that have unfolded to get a basic overview.

It is also possible to see the trend of COVID-19 by the plot lines. One can select different boroughs in NYC in the legend to show the number of new positive cases and also the total amount of new performed tests in NYC. To the right in the corner a number of tools can be observed, these can be used to zoom or enable you to hover the different plot lines to show values of the point you hover over. Feel free to play around.

The Observations

The observant reader can see there are differences in the number of positive cases for each borough. The Bronx seems have the highest number of positive cases, while Manhattan is far lower than the other boroughs.
One important thing to point out is the April 14th, a spike is clearly visible, likely linked to a change in how the number of deaths is reported (more on this event here: N.Y.C. Death Toll Soars Past 10,000 in Revised Virus Count)
Correlation can also be observed between the boroughs, as the trend and peaks follow each other. Potential factors that could explain the differences is the aim of the analysis.

Demographics

Many have attempted to pinpoint a scapegoat for the infection rate of the virus. In the beginning it was said that only the elderly would get seriously ill. Later on it became apparent that people of all ages were in danger of losing their health and even their lives to the pandemic.

So, what is it?

Is it the age, the ethnicity or are the poor the ones really in danger?

From plot below, the proportions of ethnic groups per borough is displayed . The Bronx has substantially large number of people belonging to the hispanic, latino and african american communities. Statten Islands' population is, instead, more than 50% white.

To explore the 4 plots right above, we first need to understand the concept of the boxplot: “A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.” More information for the curious user.

We can now quickly appreciate the distribution of income, medical spending and median age of each ethnic group in the city as well as showing the poverty rate per ethnic group in each borough.
It appears that the white population is the richest and the oldest.
The Hispanics have the lowest income, spend the least on medical care and are also the youngest. If you live in the Bronx or Brooklyn, you are more likely to be poor than if you live in the other boroughs of NYC.

What does this indicate for the spread of the Coronavirus in New York City? Does it give us a clear indication of which are the underlying factors that lead to some socioeconomic groups being hit hardest by the Coronavirus? It is safe to say that there is room for further exploration of demographic and the link to COVID-19.

Map Overview

What follows are two elements that are to be used in unison both can also be used separately.
The map displays the geographical distribution of the statistical indicators that are available for selection from the menu to its bottom right. The subdivision shown is the ZIP Code Tabulation Areas (ZCTAs), which are a standard subdivision of the territory in the USA, which is also used by the Census Bureau (more information for the curious).
Below the map is a scatter plot. It shows the total number of COVID-19 in the various areas against the statistic that is chosen from the menu to the top right of this plot. It is the same selection menu, which control both plots. The size of the balls is linearly dependent to the population of that area. Bigger size of the balls equals bigger population living in that area.
The two plots interact because you can click on the areas in the map and have the respective ball highlighted in the correlation plot. If a certain color stands out or you are interested in the numbers for a specific area, just click on the area, or select more than one by clicking on more. You can always reset the selection by clicking outsize any area or by selection another statistic.

Follows a brief description of the statistics available:

Commuters using public transportation. This is the number of workers who use public transportation to get to work. This gives us insight into the utilization of public transportation. It is a factor that is tied to the location, as this will be higher for places where use of public transportation is a must, such as the heart of Manhattan.
Per-capita income. This is the income per-capita in each ZCTA. Important to understand how rich an area is on average.
Education Index. Measures the average education level of an area, higher means on average more people have a higher education.
Median Age (years). The median age, in years, of the ZCTA.
Ratio of female inhabitants. The proportion of female inhabitants living in that area. We only show female ratio because the male ratio is complementary.
Total population. Total number of people that live in the ZCTA.
White people (every thousand), African American people (every thousand), Hispanic or Latino (every thousand), Asian people (every thousand). Number of people with that ethnicity every thousand living in that area.
Number of positives. Number of positive cases to COVID-19.

We can now freely explore the correlations between these factors and the number of cases. The first thing to note is just by looking at the total population. There is no strong correlation between the size of an area and the number of cases. This means that if the overcrowding is an issue, it does not come from having a high number of people living in the same area.

We then want to focus on the education index, where we can more clearly see a correlation. Higher education often leads to lower spread of Coronavirus. Here, it is important to refrain from speculations. Correlation does not imply causation. We can only point out that this is what we see from the data we gathered and make some informed decisions as to why this might have happened. Higher education folks might be more aware to the current events or it may simply be that higher education folks have higher income, providing them with increased chance of being able to stay home when the pandemic hits. We explore the correlation between these two factors a bit further down.

The exploration now starts. Looking at income we see a similar correlation to what we saw using the education index. If instead we look at specific ethnicities we don't see such strong correlations anymore. This is clear. Just stopping at the ethnicity level does not tell us enough about what is really happening. Factors such as income and education seem to play a much more important role. Can we say more about this? We try to do so by using a machine learning model, trained on these exact statistics.

The Model

To round off the interactive data visualization journey, a simple model has been implemented. Below we see correlation plots of different socio-economic variables and attributes. The variables chosen are, "Female", "Per-capita Income", "Education Index", "Average household size" and finally, "Median Age". It is of interest to see how the variation of these variables changes the prediction of positive cases.
First the correlation of variables is investigated.

Above we see that there is a strong correlation between "Per-capita Income", "Average Household Size" and "Education Index". There is also correlation between number of positive cases and the before mentioned variables. These variables are used to setup the model. The model implemented is based on simple linear regression. The coefficients of the model allow us to variate the variables and see how they affect the response variable, which is simply the number of positive cases. It is not of interest to have an accurate prediction, but more so of being able to do inference.

When moving the sliders and pressing the "Predict Positives" button a prediction will be made. It shows the predicted number of positive cases based on the input from the slides. It can now be seen that the prediction decreases when moving the "Education index" or "Percentage of females" up. While it increases when sliding "Median Age" or "Average household size" up. This corresponds with the findings earlier, higher education leads to less cases, while bigger average house size leads to more cases. The reason "Income" is not shown in the sliders, is that "Education index" is so highly correlated with it, that it fully explains the information of "Income", so likewise an increase in "Income" would decrease the prediction output. So, to have a higher chance of contracting COVID-19, be a male with low income, low education, a high household size and preferably also old.

The Links

We showed that, clearly, the ethnicities are not the whole picture, there is much more going on, between ethnicities more than within ethnicities. The full understanding cannot come from just this analysis. The point is trying to make anyone reading this aware that the factors that played into the spread of Coronavirus were much more complicated and inter-related than a brief statistical overview of which ethnicities were most hit. Careful consideration of the data showed this.

The Notebook (Github Version)

The cleaned datasets