Exploring the Pandemic

Processing and Visualizing Covid-19 Data

Michael Clark https://m-clark.github.io
2020-06-10

Table of Contents


Introduction

This is some code to get processed data and visualizations regarding the COVID-19 outbreak. The goal here is just to present some code that I’m playing with that may also make it easier for others to get their hands dirty. In other words, this is for enabling others to play with the data as well, while presenting some clean code and data. There are actually already some R packages doing some of this, but they are currently (in my opinion) problematic, slow to update, and or don’t really offer much than what you could just do yourself, as demonstrated below. I’ll likely be updating this daily for the time being.

I actually started scraping Wikipedia’s WHO incidence report tables, but any cursory glance showed numerous issues and lots of cleaning, coupled with a format that changed almost daily. I then started playing with the data behind the Johns Hopkins dashboard, which was notably better, but didn’t have some info and there are issues there as well. I then settled on Open Covid, but now just use whatever source works easiest for a situation.

I use some custom functions in the following, and I don’t show every bit of code, but most of these cases are inconsequential. The bulk of this code should be usable by anyone pretty easily if they are familiar with the tidyverse.

Resources:

For the country level plots:

For U.S. county level plots:

Check out CSCAR chum Alex Cao’s more local efforts here:

Getting Started

Functions

The following are functions that process the data from different sources. I only show parts as they are a bit long1, but they can be found here. They all require tidyverse, or at least dplyr and readr.

Open-Covid Data

The Open Covid was one of the more usable data sets early on. A bonus with this data is that it already includes population, and as it’s one source, the function doesn’t take much code. The one drawback I’ve had with it was some unnecessary column changes, which effectively break the initial read as the function tries to verify column types. Other issues include misspelled names and other minor things. Full code.


read_open_covid_data <- function(
  country = NULL,     # Filter to specific country
  current = FALSE,    # Only the current data
  totals  = FALSE     # Only totals (no regional)
) {
  
  if (current) {
    data = read_csv('https://open-covid-19.github.io/data/data_latest.csv',
                    col_types = 'Dcccccddddd')
  }
  else {
    data = read_csv('https://open-covid-19.github.io/data/data.csv',
                    col_types = 'Dcccccddddd')
  }
  
  if (!is.null(country)) {
     data = filter(data, CountryCode == country | CountryName == country)
  }
  
  .
  .
  .

Johns Hopkins Data

Here is the start of the function for the Johns Hopkins data. This data is presumably used for the website that is so popular, but there are issues. For some bizarre reason, they appear to randomly not provide country totals for Australia, Canada and China, only the province level data, so for country totals this must be dealt with. In addition, confirmed cases, deaths, and recovered are needlessly separated. Full code.


read_jh_covid_data <- function(
  first_date = lubridate::mdy('01-22-2020'),
  last_date  = Sys.Date() - 1,
  country_state = NULL,     # country if global data, state if US
  include_regions = FALSE,  # for world data, include province/state specific data?
  us         = FALSE,
  wider      = TRUE
) {
  
  if (!us) {
    cleanup_global = function(data) {
    data = data %>% 
      pivot_longer(
        -c(`Province/State`, `Country/Region`, Lat, Long),
        names_to = 'date',
        values_to = 'count'
      ) %>% 
      mutate(date = lubridate::mdy(date)) %>%  
      rename(
        province_state = `Province/State`,
        country_region = `Country/Region`,
      ) %>% 
      rename_all(tolower)
  }
    
    init_confirmed = readr::read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
    init_deaths    = readr::read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
    init_recovered = readr::read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
    
    .
    .
    .

New York Times Data

The NYT data is clean and ‘tidy’, and so a great resource for starting out. There are still some issues however (e.g. Georgia suddenly having fewer cumulative cases from one day to the next). Full code.


read_nyt_data <- function(states = TRUE) {
  if (states) {
    us_states0 <- readr::read_csv(
      "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv",
      col_types = 'Dccdd'
    )
    
    data = us_states0 %>%
      filter(!state %in% c(
        'Puerto Rico',
        'Guam',
        'Northern Mariana Islands',
        'Virgin Islands',
        'American Samoa')
      ) %>% 
      arrange(state, date) %>%
      group_by(state) %>%
      mutate(
        daily_cases = cases - lag(cases, default = NA),
        daily_cases = if_else(is.na(daily_cases), cases, daily_cases),
        daily_deaths = deaths - lag(deaths, default = NA),
        daily_deaths = if_else(is.na(daily_deaths), deaths, daily_deaths)
      ) %>% 
      ungroup() %>% 
      left_join(tibble(state = state.name, state_abb = state.abb)) %>% 
      mutate(state_abb = as.factor(state_abb))
  }
  .
  .
  .

Read the data


library(tidyverse)  # required for function

countries =  read_jh_covid_data()    # country-level for all dates

world_totals = countries %>% 
  group_by(country_region) %>% 
  filter(date == max(date))

us = read_nyt_data()  # all US data

us_current = us %>% 
  filter(date == max(date))  # only current US state totals

county_trends = read_nyt_data(states = FALSE)

Initial Look

World

Now that we have the data, we can see what’s going on. These are posted as of 2020-06-10 15:43:43 UTC.

Total Confirmed Total Deaths Death Rate
7,040,570 398,560 5.7%

United States

Total Confirmed Total Deaths Death Rate
1,973,230 111,989 5.7%

Michigan

Total Confirmed Total Deaths Death Rate
65,190 5,946 9.1%

Washtenaw County

The following uses rvest to grab the current tally for this county from the county website.


URL = 'https://www.washtenaw.org/3108/Cases'

library(rvest)

init = read_html(URL) %>% 
  html_table()

washtenaw_cases = as_tibble(init[[1]], .name_repair = 'unique')
Total Confirmed Total Hospitalized Total Recovered Total Deaths Death Rate
1,637 377 1,425 104 6.4%

The following does some additional processing before going to a plot of the trends. I pick some countries to highlight (not necessarily the ones with the most cases), and I treat the world total as a separate addition to the plot.


highlight = c(
  'US',
  'China',
  'Japan',
  'Korea, South',
  'Italy',
  'Iran',
  'United Kingdom',
  'France',
  'Germany',
  'Spain'
)

world = countries %>%
  group_by(date) %>%
  summarise(confirmed = sum(confirmed, na.rm = T))

Now we can just do a basic plot. I use ggrepel to see more clearly where the highest cases are currently.


library(ggrepel)
p = countries %>% 
  ggplot(aes(x = date, y = confirmed)) +
  geom_path(aes(group = country_region), alpha = .005) +
  geom_point(
    aes(),
    size  = 5,
    alpha = .1,
    data  = world
  ) +
  geom_point(
    aes(color = country_region),
    size  = 1,
    alpha = .5,
    data  = filter(countries, country_region %in% highlight)
  ) +
  geom_text_repel(
    aes(label = country_region, color = country_region),
    size  = 2,
    alpha = .85,
    data  = filter(countries, country_region %in% highlight, date == max(date)-1),
    show.legend = FALSE
  ) +
  scale_color_scico_d(begin = .1, end = .9) +
  scale_x_date(
    date_breaks = '2 weeks',
    labels = function(x) format(x, format = "%m/%d")
  ) +
  scale_y_continuous(
    position = "right",
    trans    = 'log',
    breaks   = c(50, unlist(map(c(1,5), function(x) x*10^(2:6)))), 
    labels   = scales::comma
  ) +
  visibly::theme_clean() + 
  labs(
    x = '', 
    y = '',
    subtitle =  'Total Confirmed Cases',
    caption  = 'Dark large dot is world total'
    ) +
  theme(
    axis.text.x  = element_text(size = 6),
    axis.text.y  = element_text(size = 6),
    axis.title.y = element_text(size = 6),
    axis.ticks.y = element_blank(),
    legend.title       = element_blank(),
    legend.key.size    = unit(.25, 'cm'),
    legend.text        = element_text(size = 6),
    legend.box.spacing = unit(0, 'mm'),
    legend.box.margin  = margin(0),
    legend.position    = 'left',
    title = element_text(size = 12)
  )

p

The following animates the previous via gganimate.


# be moved to the directory of the post to see
library(gganimate)

p_anim = p +
  transition_reveal(date) +
  shadow_wake(wake_length = 1/3, falloff = "cubic-in-out")

p_animate = animate(
  p_anim,
  nframes = 120,
  fps = 10,
  start_pause = 5,
  end_pause   = 15,
  width  = 800,
  height = 600,
  device =  'png',
  res    = 144
)

p_animate

NY Times-style Daily Count

The following is similar to the plot shown on the New York Times daily count, just without the unnecessary discretizing of the color (the legend already does that for you). I chose what I thought was a similar palette, but obviously you can play around with that. Here is the ggplot code, but I show an interactive result via highcharter.


# reduce to just the 20 countries with the most cases
top_20 = world_totals  %>% 
  ungroup() %>% 
  top_n(20, confirmed) %>% 
  arrange(desc(confirmed))
 
# this is to create a ruled effect, not necessary
plot_data = countries %>%
  filter(country_region %in% top_20$country_region) %>%
  group_by(country_region) %>% 
  mutate(
    daily_confirmed = confirmed - lag(confirmed),
    daily_confirmed = if_else(is.na(daily_confirmed), confirmed, daily_confirmed)
  ) %>% 
  ungroup() %>% 
  mutate(
    country_region   = ordered(country_region, levels = rev(top_20$country_region)),
    line_positions = as.numeric(country_region) + .5,
    line_positions = ifelse(line_positions == max(line_positions), NA, line_positions)
  ) 
  
plot_data %>% 
  ggplot(aes(x = date, y = country_region)) +
  geom_tile(
    aes(
      fill   = daily_confirmed,
      width  = .9,
      height = 0.5
    ),
    na.rm = T,
    size  = 2
  ) +
  geom_hline(
    aes(yintercept = line_positions),
    color = 'gray92',
    size  = .25
  ) +
  scale_fill_scico(
    end = .75,
    na.value = 'gray98',
    palette  = 'lajolla',
    trans    = 'log',
    breaks   = c(5, 25, 100, 500, 2500)
  ) +
  labs(x = '', y = '') +
  guides(fill = guide_legend(title = 'New Cases')) +
  visibly::theme_clean() +
  theme(
    axis.ticks.y = element_blank(),
    legend.text  = element_text(size = 6),
    legend.title = element_text(size = 10)
  )

The following shows the percentage increase in cases over the previous day (capped at 100 or greater). Early on this isn’t very useful (e.g. going from 5 to 10 is a 100% increase). We can see that China and South Korea have minimal to no increase day to day at present, while other countries are continuing to see relatively large increases.

The following shows the number of confirmed cases per 10000 people using the Open Covid data, which includes population. However, this obviously doesn’t account for the lack of testing, country ineptitude in counting, country discretion in counting, and many other things. To keep things clean, we focus on the ten countries with the most cases.


countries_oc = read_open_covid_data() %>% 
  filter(is.na(region_code))

State Totals

Here is a data table for easy look-up of current U.S. It is based on the New York Times database.

We can visualize the state level data in numerous ways. Here is the world trend plot previously seen, now applied just to the U.S. I used plotly to make it interactive.


highlight = us_current %>% 
  top_n(10, cases) %>% 
  pull(state_abb)

us_total = countries %>% 
  filter(country_region == 'US') %>% 
  rename(cases = confirmed)

p = us %>% 
  ggplot(aes(x = date, y = cases)) +
  geom_path(aes(group = state), alpha = .01) +
  geom_point(
    aes(),
    size  = 6,
    alpha = .1,
    data  = us_total
  ) +
  geom_point(
    aes(color = state),
    size  = .5,
    alpha = .5,
    data  = filter(us, state_abb %in% highlight)
  ) +
  geom_text_repel(
    aes(label = state_abb, color = state),
    size  = 2,
    alpha = .85,
    data  = filter(us_current, state_abb %in% highlight)
  ) +
  scale_color_scico_d(begin = .1, end = .9) +
  scale_x_date(date_breaks = '2 weeks') +
  scale_y_continuous(
    position = "right",
    trans    = 'log',
    breaks   = c(50, unlist(map(c(1,5), function(x) x*10^(2:6)))), 
    labels   = scales::comma
    ) +
  visibly::theme_clean() + 
  labs(
    x = '', 
    y = '',
    subtitle =  'Total Confirmed Cases',
    caption = 'Dark large dot is U.S. total'
    ) +
  theme(
    axis.text.x  = element_text(size = 6),
    axis.text.y  = element_text(size = 6),
    axis.title.y = element_text(size = 6),
    axis.ticks.y = element_blank(),
    legend.title       = element_blank(),
    legend.key.size    = unit(.25, 'cm'),
    legend.text        = element_text(size = 6),
    legend.box.spacing = unit(0, 'mm'),
    legend.box.margin  = margin(0),
    legend.position = 'left',
    title = element_text(size = 12)
  )

State Bins

Here we look at counts and death rates via a ‘binned’ map. There are numerous issues with trying to depict numeric information on a map. This at least tries to solve the issue of state size by making them all equal while retaining the basic shape of the U.S.


library(statebins)

us_current %>% 
  filter(state != 'District of Columbia') %>%
  statebins(
    state_col = 'state',
    value_col = "log(cases)",
    palette   = "OrRd", 
    direction = 1,
    name = "Covid Counts (log)"
  ) +
  theme_statebins(base_size = 8)


us_current %>% 
  mutate(death_rate = deaths/cases) %>% 
  statebins(
    state_col = 'state',
    value_col = "death_rate",
    palette   = "OrRd", 
    direction = 1,
    name = "Death Rate"
  ) +
  theme_statebins(base_size = 8)

Geofacet

We can plot trends in a map-like fashion too, and the following uses the geofacet package. In early April, this is one of the scarier graphics, with no flattening in sight at the state level. By June, you can see the upticks in various states due to repopening.


library(geofacet)

us %>% 
  filter(cases != 0, state != 'District of Columbia') %>%
  ggplot(aes(date, cases, group = state)) +
  geom_path(color = '#ff550080') +
  labs(y = '', x = '') +
  facet_geo(~state, scales = 'free') +
  visibly::theme_clean() +
  theme(
    axis.text.x  = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y  = element_text(size = 4),
    strip.text   = element_text(size = 4),
  )

Daily counts

We can do the daily counts as before.


levs = us_current %>% 
  arrange(cases) %>% 
  pull(state)

plot_data = us %>%
  mutate(
    state    = ordered(state, levels = levs),
    line_positions = as.numeric(state) + .5,
    line_positions = ifelse(line_positions == max(line_positions), NA, line_positions)
  ) 
  
plot_data %>% 
  ggplot(aes(x = date, y = state)) +
  geom_tile(
    aes(
      fill   = daily_cases,
      width  = .9,
      height = 0.5
    ),
    na.rm = T,
    size  = 2
  ) +
  geom_hline(
    aes(yintercept = line_positions),
    color = 'gray92',
    size  = .25
  ) +
  scale_fill_scico(
    end = .75,
    na.value = 'gray98',
    palette  = 'lajolla',
    trans    = 'log',
    breaks   = c(5, 25, 100, 500, 2500)
  ) +
  labs(x = '', y = '') +
  guides(fill = guide_legend(title = 'New Cases')) +
  visibly::theme_clean() +
  theme(
    axis.text.y  = element_text(size = 6),
    axis.ticks.y = element_blank(),
    legend.text  = element_text(size = 6),
    legend.title = element_text(size = 10)
  )

The following shows the number of confirmed cases per 10000 people. Top 10 states are highlighted.

Since states began opening in May we can look at what’s happened from that point on. Highlighted are the 10 states with the largest and smallest percentage increase since the beginning of May.

Let’s start to look at more specific results, such as county level counts.

Michigan

The following creates an animated plot of Michigan counties over time.


library(sf)

mi_counties = county_trends %>% 
  filter(state == 'Michigan')

all_counties_sf = st_as_sf(maps::map("county", plot = FALSE, fill = TRUE))

prep_counties = mi_counties %>% 
  mutate(county = str_remove(tolower(county), '[[:punct:]]'), 
         state = tolower(state)) %>% 
  droplevels() %>% 
  arrange(county)

mi_counties_sf = all_counties_sf %>% 
  filter(grepl("michigan", ID)) %>% 
  separate(ID, into = c('state', 'county'), remove = F, sep = ',') %>%
  select(-state) %>%
  left_join(prep_counties) %>% 
  mutate(cases = if_else(is.na(cases), 0, cases))

mi_plot =  mi_counties_sf %>% 
  arrange(date) %>% 
  ggplot(aes(group = date)) +
  geom_sf(aes(fill = cases), color = NA) +
  coord_sf(xlim   = c(-91,-82),
           ylim   = c(41, 48.5),
           expand = FALSE) +
  scale_fill_scico(
    trans    = 'log',
    breaks   = c(1, 10, 50, 250, 1250, 5000),
    palette  = 'lajolla',
    na.value = 'gray95'
  ) +
  guides(fill = guide_legend(title = '')) +
  theme_void() +
  theme(
    legend.key.size = unit(.25, 'cm'),
    legend.text     =  element_text(size = 6),
    legend.title    = element_text(size = 6)
  )

mi_plot

Now let’s show it over time with gganimate.


library(gganimate)

mi_plot_anim = mi_plot +
  labs(title = '{closest_state}') +
  transition_states(date, transition_length = 2) +
  theme(title = element_text(color ='gray50'))

mi_plot_anim

Interactive County Map

I thought I would try an interactive map for Michigan counties using highcharts via the highcharter package. This requires creating a unique county code that is nothing more than a FIPS code (but oddly doesn’t simply use the actual FIPS code).


library(highcharter)

plot_data = mi_counties %>%
  filter(date == max(date)) %>%
  mutate(code = paste0('us-mi-', str_sub(fips, start = 3)))

hcmap(
  "countries/us/us-mi-all",
  data = plot_data,
  name = "Michigan",
  value = "cases",
  joinBy = c("hc-key", "code"),
  borderColor = "#fffff8",
  borderWidth = .1
) %>%
  hc_colorAxis(
    dataClasses = color_classes(
      c(0, 10, 100, 500, 1000, 2000, 5000, 1e4, 2e4, 5e4),
      colors = scico(6, end = .9, palette = 'lajolla')
    )
  ) %>%
  hc_legend(
    layout = "vertical",
    align = "right",
    floating = TRUE,
    valueDecimals = 0#,
    # valueSuffix = "%"
  ) %>% 
  hc_credits(enabled = FALSE)

Washtenaw County

We can also do the trend for Washtenaw county here by just filtering the county trends data. There is a note at the county website:

Most of the diagnoses in the 3/31 spike on the New COVID-19 Cases Reported per Day chart represent a backlog of labs that have been pending for 1-2 weeks. This means most of these individuals got sick and were tested at least a week ago, but are receiving test results now because labs are just now catching up on tests.

Note the error in the NY Times data base starting in June (as of June 10th). It’s not clear why or how they would have gotten this from any county or state source, but they added a couple hundred cases for Washtenaw.

US

For the following map of the whole US, I wanted to do a point plot since counties are arbitrarily shaped, making it easy to think things are going weird in random places. We can use the county centroid locations, make very faint the county lines, and not use a ridiculous point range size to get a better sense of the overall pattern. I’m not used to the sf package, so it took a while to get what I wanted beyond a simple county map. Adding points was not very intuitive, but I got it sorted out in the end, so you get the benefit of the cleaned up code.


library(sf)

prep_counties = county_trends %>% 
  mutate(
    county = str_remove(tolower(county), '[[:punct:]]'),
    state  = tolower(state),
    cases = if_else(is.na(cases), 0, cases)
  ) %>% 
  filter(
    !state %in% c(
      'alaska',
      'hawaii'
    )
  ) 

all_counties_sf2 = all_counties_sf %>% 
  separate(ID, into = c('state', 'county'), remove = F, sep = ',') %>%
  left_join(prep_counties) %>% 
  select(ID, geom, everything())

county_point_plot = 
  ggplot(data = all_counties_sf2) +
  geom_sf(fill = 'gray99', color = 'gray97', size = .1) +
  geom_sf(
    aes(
      color = cases,
      size  = cases
    ),
    data = st_centroid(all_counties_sf2),
    alpha = .05,
    show.legend = F
  ) +
  scale_color_scico(
    trans    = 'log',
    breaks   = c(1, 10, 50, 250, 1250, 5000),
    na.value = 'gray99'
  ) +
  labs(title = 'Confirmed cases') +
  scale_size_continuous(range =  c(.1, 10)) +
  theme_void()  +
  theme(title = element_text(color = 'gray50'))

county_point_plot

What’s interesting is if we do a similar plot, but actually look for more anomalous results, rather than just calling more largely populated areas hotspots. To do this we need either a per capita rate, a percentage change or similar, or even adjusting the size range, while making the assumption that populated areas are already the ones that are going to have the most cases.

I grabbed population values with tidycensus (not shown), then removed the top 20 most populated counties from consideration. You can get the data here.

The visualization shows numerous hotspots across the deep south, and others not considered in the Midwest, e.g. the Indianapolis area, Rocky Mountain areas, and more. It definitely makes the spread more apparent than the simple counts might suggest.

Interactive County Map

If you do want an actual county map, here is an interactive one as before. To make it easier I add state abbreviations to the original data before creating codes for every county. A value is initially missing for Oglala county South Dakota, because highcharts hasn’t updated its map data source in several years, and it has since been renamed from Shannon county. Likewise, in Alaska, highcharts still refers to Wade Hampton rather than Kusilvak.


states = tibble(state = state.name, state_abb = state.abb)

plot_data = county_trends %>%
  filter(date == max(date)) %>% 
  left_join(states) %>% 
  mutate(
    code = paste0('us-', tolower(state_abb), '-', str_sub(fips, start = 3)),
    code = if_else(code == 'us-sd-102', 'us-sd-113', code),  # Oglala
    code = if_else(code == 'us-ak-158', 'us-ak-270', code),  # Kusilvak
  )
  

# plot_data

hcmap(
  "countries/us/us-all-all",
  data = plot_data,
  name = "Confirmed cases:",
  value = "cases",
  joinBy = c("hc-key", "code"),
  borderColor = "#fffff8",
  borderWidth = .1,
  download_map_data = T
) %>%
  hc_colorAxis(
    dataClasses = color_classes(
      c(0, sort(unlist(map(c(1, 5), function(x)  x * 10^(1:5))))),
      colors = scico(6, end = .9, palette = 'lajolla')
    )
  ) %>%
  hc_legend(
    layout = "vertical",
    align = "right",
    floating = TRUE,
    valueDecimals = 0#,
    # valueSuffix = "%"
  ) %>% 
  hc_credits(enabled = FALSE)

Hospital Data

I’ve started playing with some hospital data for a dashboard. The (still pretty messy) source code is available on a link there as well, so feel free to play with it for your own needs.

Models

The following models U.S. state daily trends. First we’ll use data from NY Times, which seems to be cleaner than some other sources, but there are still random issues (e.g. Georgia suddenly having fewer cumulative cases from one day to the next). I add a bit of mild cleanup to provide each state a common starting point at the beginning of March, and do some other things specific to package requirements.


us_for_model = us %>% 
  tidyr::expand(date, state) %>% 
  filter(date >= '2020-02-14') %>% 
  left_join(us) %>% 
  arrange(state, date) %>% 
  group_by(state) %>% 
  fill(fips, state_abb, .direction = 'up')  %>% 
  mutate_at(vars(cases:daily_deaths), function(x) ifelse(is.na(x), 0, x)) %>% 
  ungroup() %>% 
  mutate(
    state_abb = fct_explicit_na(state_abb, na_level = 'DC'),
    date_num  = as.numeric(date)
    ) %>% 
  filter(daily_cases >= 0)  # because some states evidently can't count

# us_for_model %>% filter(state == 'Michigan')  # check

To explore the trends, I will use a generalized additive model with mgcv, with a separate smooth for each state. It’s not clear to me whether a specific count distribution is really required given the very high number of daily counts, but we’ll go ahead and use the negative binomial. I also use a specific function to speed up processing, but this is not required for those willing to wait. This is a fairly naive model, and, as implemented, has no natural way to project the eventual downward trend to zero. However it should be better than deterministic models for very short-term prediction.


library(mgcv)

model_state = bam(
  daily_cases ~ s(date_num, state_abb, bs = 'fs'),
  family   = nb,
  data     = us_for_model,
  nthreads = 10
)

# summary(model_state)

We can visualize the estimated trends. Looks to be fairly on point.

The following plots the derivatives of each state curve using gratia. Each dot reflects a state’s peak positive rate of change. Most peaks are past at this point. You can double click on a state in the legend to focus on that state.

The following predicts the next 4 days for the ten states with the most confirmed cases.

A Bayesian SIR Model for Washtenaw County

This model is based on the article and code from:

Contemporary statistical inference for infectious disease models using Stan. Chatzilenaa, van Leeuwen,Ratmann, Baguelin, & Demiris (2019) (Chatzilena et al. 2019)

The following is only a very slightly revised version of the code they made available in GitHub. In essence it is a standard SIR epidemiological model, put into a probabilistic framework, to model infections. They have code for another more complicated model, but I’ve had trouble getting that to converge.


functions {
  real[] SIR(real t,  // time
             real[] y,            // system state {susceptible,infected,recovered}
             real[] theta,        // parameters
             real[] x_r,
             int[]  x_i) {
    
    real dy_dt[3];
    
    dy_dt[1] = -theta[1] * y[1] * y[2];                     // S
    dy_dt[2] =  theta[1] * y[1] * y[2] - theta[2] * y[2];   // I
    dy_dt[3] =                           theta[2] * y[2];   // R
    
    return dy_dt;
  }
}

data {
  int<lower = 1> n_obs;           // number of days observed
  int<lower = 1> n_theta;         // number of model parameters
  int<lower = 1> n_difeq;         // number of differential equations
  int<lower = 1> n_pop;           // population 
  int y[n_obs];                   // data, total number of infected individuals each day
  real t0;                        // initial time point (zero)
  real ts[n_obs];                 // time points observed
}

transformed data {
  real x_r[0];
  int x_i[0];
}

parameters {
  real<lower = 0> theta[n_theta]; // model parameters {beta, gamma}
  real<lower = 0, upper = 1> S0;  // initial fraction of susceptible individuals
}

transformed parameters{
  real y_hat[n_obs, n_difeq];     // solution from the ODE solver
  real y_init[n_difeq];           // initial conditions for both fractions of S and I
  
  y_init[1] = S0;
  y_init[2] = 1 - S0;
  y_init[3] = 0;
  
  y_hat = integrate_ode_rk45(SIR, y_init, t0, ts, theta, x_r, x_i);
}

model {
  real lambda[n_obs];             // poisson parameter
  
  //priors
  theta[1] ~ lognormal(0,1);
  theta[2] ~ gamma(0.004, 0.02);  // Assume mean infectious period = 5 days 
  S0 ~ beta(99, 1);
  
  //likelihood
  for (i in 1:n_obs) {
    lambda[i] = y_hat[i, 2] * n_pop;
  }
  y ~ poisson(lambda);
}

generated quantities {
  real R_0;                       // Basic reproduction number
  
  R_0 = theta[1]/theta[2];

Here are the data steps. For ease we grab county info from the NY Times database. Beyond that we set things up for the Stan.


washtenaw <- readr::read_csv(
  "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
) %>%
  filter(county == 'Washtenaw')

washtenaw = washtenaw %>% 
  mutate(day = 0:(n()-1)) %>% 
  mutate(infected = cases - lag(cases),
         infected = ifelse(is.na(infected), 0, infected))

N = nrow(washtenaw)    # Number of days observed throughout the outbreak
pop = 370963           # Population (from Census/Wikipedia)
sample_time = 1:N

# Specify parameters to monitor
parameters = c("y_hat", "y_init", "theta",  "R_0")  # SIR deterministic model parameters

# Modify data into a form suitable for Stan
covid_data = list(
  n_obs   = N,
  n_theta = 2,
  n_difeq = 3,
  n_pop   = pop,
  y       = washtenaw$infected,
  t0      = 0,
  ts      = sample_time
)


# Set initial values:
set.seed(123)

initial_values = function() {
  list(
    theta = c(runif(1, .5, 3), runif(1, 0.2, 0.4)),
    S0    = runif(1, (pop - 3) / pop, (pop - 1) / pop)
    )
}

library(rstan)

sir_model_fit = 
  stan(
    model_code = sir_model_code,
    data       = covid_data,
    pars       = parameters,
    init       = initial_values,
    cores      = 4,
    warmup     = 2000,
    iter       = 4000,
    thin       = 8,
    seed       = 123,
    control    = list(
      adapt_delta   = .99,
      max_treedepth = 15
    )
)

The current \(R_0\) produced by the model is 1.01, which would suggest we’re getting close to a good state in this county (negative \(R_0\) would suggest the infection is no longer spreading). However, given that there is no vaccine, almost the entire population is still at risk and susceptible to the disease.

Now we are ready to plot the result. The following uses tidybayes package for the visualization of the predictions. We can see that even adding a probabilistic approach to this deterministic model, it misses the mark, and notably undercounted some dates2. But in the end it does not miss it by too much except in one crucial aspect- determining the drop off.


library(tidybayes)

sir_y_hat = gather_draws(sir_model_fit,  y_hat[day, type]) %>% 
  ungroup() %>% 
  mutate(type = factor(type, labels = c('S', 'I', 'R')))

samples_infected = sir_y_hat %>% 
  filter(type == 'I') %>% 
  select(day, type, .draw, .value) %>% 
  left_join(washtenaw %>% select(date, day)) %>% 
  rename(prop_infected = .value)  

bilbao = scico::scico(10)

samples_infected %>% 
  mutate(pred_infected = prop_infected*pop) %>% 
  ggplot(aes(x = date, y = pred_infected)) +
  stat_lineribbon(
    aes(y = pred_infected),
    color = bilbao[10],
    size = 1,
    .width = c(.99, .95, .8, .5),
    alpha = 0.25
  ) +
  geom_point(aes(y = infected), color = bilbao[10], alpha=.5, data = washtenaw) +
  guides(x = guide_axis(n.dodge = 2)) +
  labs(x = '', y = '', title = 'Bayesian SIR model for daily infected counts', subtitle = 'Washtenaw County') +
  scico::scale_fill_scico_d(begin = .25, palette = "bilbao") +
  scale_x_date(date_breaks = '3 days', date_labels = '%b-%d') +
  visibly::theme_clean()

We can do a posterior predictive check of our expected vs. observed counts. Not perfect, and definitely could be better. It’s actually been getting worse as time has progressed, and the counts and reporting times get more erratic.

Chatzilena, Anastasia, Edwin [van Leeuwen], Oliver Ratmann, Marc Baguelin, and Nikolaos Demiris. 2019. “Contemporary Statistical Inference for Infectious Disease Models Using Stan.” Epidemics 29: 100367. https://doi.org/https://doi.org/10.1016/j.epidem.2019.100367.


  1. I use Distill for this website, and strangely it does not have basic code folding.↩︎

  2. There was a backlog of tests that came through on April 1st, so I don’t fault the model for missing that outlier.↩︎

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com//m-clark/m-clark.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Clark (2020, June 10). Michael Clark: Exploring the Pandemic. Retrieved from https://m-clark.github.io/posts/2020-03-23-covid/

BibTeX citation

@misc{clark2020exploring,
  author = {Clark, Michael},
  title = {Michael Clark: Exploring the Pandemic},
  url = {https://m-clark.github.io/posts/2020-03-23-covid/},
  year = {2020}
}