Exploring the Pandemic

Processing and Visualizing Covid-19 Data

Michael Clark https://m-clark.github.io
2020-06-10

Table of Contents


Introduction

This is some code to get processed data and visualizations regarding the COVID-19 outbreak. The goal here is just to present some code that I’m playing with that may also make it easier for others to get their hands dirty. In other words, this is for enabling others to play with the data as well, while presenting some clean code and data. There are actually already some R packages doing some of this, but they are currently (in my opinion) problematic, slow to update, and or don’t really offer much than what you could just do yourself, as demonstrated below. I’ll likely be updating this daily for the time being.

I actually started scraping Wikipedia’s WHO incidence report tables, but any cursory glance showed numerous issues and lots of cleaning, coupled with a format that changed almost daily. I then started playing with the data behind the Johns Hopkins dashboard, which was notably better, but didn’t have some info and there are issues there as well. I then settled on Open Covid, but now just use whatever source works easiest for a situation.

I use some custom functions in the following, and I don’t show every bit of code, but most of these cases are inconsequential. The bulk of this code should be usable by anyone pretty easily if they are familiar with the tidyverse.

Resources:

For the country level plots:

For U.S. county level plots:

Check out CSCAR chum Alex Cao’s more local efforts here:

Getting Started

Functions

The following are functions that process the data from different sources. I only show parts as they are a bit long1, but they can be found here. They all require tidyverse, or at least dplyr and readr.

Open-Covid Data

The Open Covid was one of the more usable data sets early on. A bonus with this data is that it already includes population, and as it’s one source, the function doesn’t take much code. The one drawback I’ve had with it was some unnecessary column changes, which effectively break the initial read as the function tries to verify column types. Other issues include misspelled names and other minor things. Full code.


read_open_covid_data <- function(
  country = NULL,     # Filter to specific country
  current = FALSE,    # Only the current data
  totals  = FALSE     # Only totals (no regional)
) {
  
  if (current) {
    data = read_csv('https://open-covid-19.github.io/data/data_latest.csv',
                    col_types = 'Dcccccddddd')
  }
  else {
    data = read_csv('https://open-covid-19.github.io/data/data.csv',
                    col_types = 'Dcccccddddd')
  }
  
  if (!is.null(country)) {
     data = filter(data, CountryCode == country | CountryName == country)
  }
  
  .
  .
  .

Johns Hopkins Data

Here is the start of the function for the Johns Hopkins data. This data is presumably used for the website that is so popular, but there are issues. For some bizarre reason, they appear to randomly not provide country totals for Australia, Canada and China, only the province level data, so for country totals this must be dealt with. In addition, confirmed cases, deaths, and recovered are needlessly separated. Full code.


read_jh_covid_data <- function(
  first_date = lubridate::mdy('01-22-2020'),
  last_date  = Sys.Date() - 1,
  country_state = NULL,     # country if global data, state if US
  include_regions = FALSE,  # for world data, include province/state specific data?
  us         = FALSE,
  wider      = TRUE
) {
  
  if (!us) {
    cleanup_global = function(data) {
    data = data %>% 
      pivot_longer(
        -c(`Province/State`, `Country/Region`, Lat, Long),
        names_to = 'date',
        values_to = 'count'
      ) %>% 
      mutate(date = lubridate::mdy(date)) %>%  
      rename(
        province_state = `Province/State`,
        country_region = `Country/Region`,
      ) %>% 
      rename_all(tolower)
  }
    
    init_confirmed = readr::read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
    init_deaths    = readr::read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
    init_recovered = readr::read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
    
    .
    .
    .

New York Times Data

The NYT data is clean and ‘tidy’, and so a great resource for starting out. There are still some issues however (e.g. Georgia suddenly having fewer cumulative cases from one day to the next). Full code.


read_nyt_data <- function(states = TRUE) {
  if (states) {
    us_states0 <- readr::read_csv(
      "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv",
      col_types = 'Dccdd'
    )
    
    data = us_states0 %>%
      filter(!state %in% c(
        'Puerto Rico',
        'Guam',
        'Northern Mariana Islands',
        'Virgin Islands',
        'American Samoa')
      ) %>% 
      arrange(state, date) %>%
      group_by(state) %>%
      mutate(
        daily_cases = cases - lag(cases, default = NA),
        daily_cases = if_else(is.na(daily_cases), cases, daily_cases),
        daily_deaths = deaths - lag(deaths, default = NA),
        daily_deaths = if_else(is.na(daily_deaths), deaths, daily_deaths)
      ) %>% 
      ungroup() %>% 
      left_join(tibble(state = state.name, state_abb = state.abb)) %>% 
      mutate(state_abb = as.factor(state_abb))
  }
  .
  .
  .

Read the data


library(tidyverse)  # required for function

countries =  read_jh_covid_data()    # country-level for all dates

world_totals = countries %>% 
  group_by(country_region) %>% 
  filter(date == max(date))

us = read_nyt_data()  # all US data

us_current = us %>% 
  filter(date == max(date))  # only current US state totals

county_trends = read_nyt_data(states = FALSE)

Initial Look

World

Now that we have the data, we can see what’s going on. These are posted as of 2020-06-10 15:43:43 UTC.

Total Confirmed Total Deaths Death Rate
7,040,570 398,560 5.7%

United States

Total Confirmed Total Deaths Death Rate
1,973,230 111,989 5.7%

Michigan

Total Confirmed Total Deaths Death Rate
65,190 5,946 9.1%

Washtenaw County

The following uses rvest to grab the current tally for this county from the county website.


URL = 'https://www.washtenaw.org/3108/Cases'

library(rvest)

init = read_html(URL) %>% 
  html_table()

washtenaw_cases = as_tibble(init[[1]], .name_repair = 'unique')
Total Confirmed Total Hospitalized Total Recovered Total Deaths Death Rate
1,637 377 1,425 104 6.4%

The following does some additional processing before going to a plot of the trends. I pick some countries to highlight (not necessarily the ones with the most cases), and I treat the world total as a separate addition to the plot.


highlight = c(
  'US',
  'China',
  'Japan',
  'Korea, South',
  'Italy',
  'Iran',
  'United Kingdom',
  'France',
  'Germany',
  'Spain'
)

world = countries %>%
  group_by(date) %>%
  summarise(confirmed = sum(confirmed, na.rm = T))

Now we can just do a basic plot. I use ggrepel to see more clearly where the highest cases are currently.


library(ggrepel)
p = countries %>% 
  ggplot(aes(x = date, y = confirmed)) +
  geom_path(aes(group = country_region), alpha = .005) +
  geom_point(
    aes(),
    size  = 5,
    alpha = .1,
    data  = world
  ) +
  geom_point(
    aes(color = country_region),
    size  = 1,
    alpha = .5,
    data  = filter(countries, country_region %in% highlight)
  ) +
  geom_text_repel(
    aes(label = country_region, color = country_region),
    size  = 2,
    alpha = .85,
    data  = filter(countries, country_region %in% highlight, date == max(date)-1),
    show.legend = FALSE
  ) +
  scale_color_scico_d(begin = .1, end = .9) +
  scale_x_date(
    date_breaks = '2 weeks',
    labels = function(x) format(x, format = "%m/%d")
  ) +
  scale_y_continuous(
    position = "right",
    trans    = 'log',
    breaks   = c(50, unlist(map(c(1,5), function(x) x*10^(2:6)))), 
    labels   = scales::comma
  ) +
  visibly::theme_clean() + 
  labs(
    x = '', 
    y = '',
    subtitle =  'Total Confirmed Cases',
    caption  = 'Dark large dot is world total'
    ) +
  theme(
    axis.text.x  = element_text(size = 6),
    axis.text.y  = element_text(size = 6),
    axis.title.y = element_text(size = 6),
    axis.ticks.y = element_blank(),
    legend.title       = element_blank(),
    legend.key.size    = unit(.25, 'cm'),
    legend.text        = element_text(size = 6),
    legend.box.spacing = unit(0, 'mm'),
    legend.box.margin  = margin(0),
    legend.position    = 'left',
    title = element_text(size = 12)
  )

p

The following animates the previous via gganimate.


# be moved to the directory of the post to see
library(gganimate)

p_anim = p +
  transition_reveal(date) +
  shadow_wake(wake_length = 1/3, falloff = "cubic-in-out")

p_animate = animate(
  p_anim,
  nframes = 120,
  fps = 10,
  start_pause = 5,
  end_pause   = 15,
  width  = 800,
  height = 600,
  device =  'png',
  res    = 144
)

p_animate

NY Times-style Daily Count

The following is similar to the plot shown on the New York Times daily count, just without the unnecessary discretizing of the color (the legend already does that for you). I chose what I thought was a similar palette, but obviously you can play around with that. Here is the ggplot code, but I show an interactive result via highcharter.


# reduce to just the 20 countries with the most cases
top_20 = world_totals  %>% 
  ungroup() %>% 
  top_n(20, confirmed) %>% 
  arrange(desc(confirmed))
 
# this is to create a ruled effect, not necessary
plot_data = countries %>%
  filter(country_region %in% top_20$country_region) %>%
  group_by(country_region) %>% 
  mutate(
    daily_confirmed = confirmed - lag(confirmed),
    daily_confirmed = if_else(is.na(daily_confirmed), confirmed, daily_confirmed)
  ) %>% 
  ungroup() %>% 
  mutate(
    country_region   = ordered(country_region, levels = rev(top_20$country_region)),
    line_positions = as.numeric(country_region) + .5,
    line_positions = ifelse(line_positions == max(line_positions), NA, line_positions)
  ) 
  
plot_data %>% 
  ggplot(aes(x = date, y = country_region)) +
  geom_tile(
    aes(
      fill   = daily_confirmed,
      width  = .9,
      height = 0.5
    ),
    na.rm = T,
    size  = 2
  ) +
  geom_hline(
    aes(yintercept = line_positions),
    color = 'gray92',
    size  = .25
  ) +
  scale_fill_scico(
    end = .75,
    na.value = 'gray98',
    palette  = 'lajolla',
    trans    = 'log',
    breaks   = c(5, 25, 100, 500, 2500)
  ) +
  labs(x = '', y = '') +
  guides(fill = guide_legend(title = 'New Cases')) +
  visibly::theme_clean() +
  theme(
    axis.ticks.y = element_blank(),
    legend.text  = element_text(size = 6),
    legend.title = element_text(size = 10)
  )

The following shows the percentage increase in cases over the previous day (capped at 100 or greater). Early on this isn’t very useful (e.g. going from 5 to 10 is a 100% increase). We can see that China and South Korea have minimal to no increase day to day at present, while other countries are continuing to see relatively large increases.

The following shows the number of confirmed cases per 10000 people using the Open Covid data, which includes population. However, this obviously doesn’t account for the lack of testing, country ineptitude in counting, country discretion in counting, and many other things. To keep things clean, we focus on the ten countries with the most cases.


countries_oc = read_open_covid_data() %>% 
  filter(is.na(region_code))

State Totals

Here is a data table for easy look-up of current U.S. It is based on the New York Times database.

We can visualize the state level data in numerous ways. Here is the world trend plot previously seen, now applied just to the U.S. I used plotly to make it interactive.


highlight = us_current %>% 
  top_n(10, cases) %>% 
  pull(state_abb)

us_total = countries %>% 
  filter(country_region == 'US') %>% 
  rename(cases = confirmed)

p = us %>% 
  ggplot(aes(x = date, y = cases)) +
  geom_path(aes(group = state), alpha = .01) +
  geom_point(
    aes(),
    size  = 6,
    alpha = .1,
    data  = us_total
  ) +
  geom_point(
    aes(color = state),
    size  = .5,
    alpha = .5,
    data  = filter(us, state_abb %in% highlight)
  ) +
  geom_text_repel(
    aes(label = state_abb, color = state),
    size  = 2,
    alpha = .85,
    data  = filter(us_current, state_abb %in% highlight)
  ) +
  scale_color_scico_d(begin = .1, end = .9) +
  scale_x_date(date_breaks = '2 weeks') +
  scale_y_continuous(
    position = "right",
    trans    = 'log',
    breaks   = c(50, unlist(map(c(1,5), function(x) x*10^(2:6)))), 
    labels   = scales::comma
    ) +
  visibly::theme_clean() + 
  labs(
    x = '', 
    y = '',
    subtitle =  'Total Confirmed Cases',
    caption = 'Dark large dot is U.S. total'
    ) +
  theme(
    axis.text.x  = element_text(size = 6),
    axis.text.y  = element_text(size = 6),
    axis.title.y = element_text(size = 6),
    axis.ticks.y = element_blank(),
    legend.title       = element_blank(),
    legend.key.size    = unit(.25, 'cm'),
    legend.text        = element_text(size = 6),
    legend.box.spacing = unit(0, 'mm'),
    legend.box.margin  = margin(0),
    legend.position = 'left',
    title = element_text(size = 12)
  )