Table of contents

About the Model

Our COVID-19 prediction model adds the power of artificial intelligence on top of a classic infectious disease model. We developed a simulator based on the SEIR model (Wikipedia) to simulate the COVID-19 epidemic in each region. The parameters/inputs of this simulator are then learned using machine learning techniques that attempts to minimize the error between the projected outputs and the actual results. We utilize daily deaths data reported by each region to forecast future reported deaths. After some additional validation techniques (to minimize a phenomenon called overfitting), we use the learned parameters to simulate the future and make projections.

The goal of this project is to showcase the strengths of artificial intelligence to tackle one of the world’s most difficult problems: predict the track of a pandemic. Here, we use a pure data-driven approach by letting the machine do the learning.

We are currently making projections for: the United States, all 50 US states (plus DC, PR, VI, Guam) and 63 countries (including all 27 EU countries). Combined, these 64 countries account for 99% of all global COVID-19 deaths.

See an analysis of our model by Dr. Carl T. Bergstrom, Professor of Biology at the University of Washington.

Click here to read a more in-depth description of how our model operates.

Back to Top

How Our Model is Different

Back to Top

Historical Performance

Last Updated: Jun 1

A model isn’t very useful if it’s not accurate. Below is our analysis on how various models considered by the CDC have performed over the past few weeks. Because the CDC receives weekly projections from every Monday, we use projections from past Mondays to evaluate the models.

Click here to see performance evaluations for past dates.

May 30 evaluation of state-by-state projections

States comparison

May 30 evaluation of US projections

US comparison

Notes

Projections taken from: https://github.com/reichlab/covid19-forecast-hub

Truth data from Johns Hopkins: https://github.com/CSSEGISandData/COVID-19

Back to Top

Concerns with the IHME model

In this section we will compare our projections with a popular model developed by the Institute for Health Metrics and Evaluation (IHME) and commonly referred to by the White House and media. Below, we compare a sample of our past projections (C19Pro) with IHME for US, New York, Michigan, New Jersey, California and Italy, some of the most heavily impacted regions.

As you can see from the plots above, IHME’s projections failed to accurately capture the true trajectory for these regions. Our projections, meanwhile, have been significantly more accurate. Below, we will go into further details as to why IHME is a flawed model.

There are existing news articles such as Vox, STAT News, CNN, and Quartz that agree with our concerns.

In the words of Ruth Etzioni, an epidemiologist at Seattle’s Fred Hutchinson Cancer Research Center, “that [the IHME model] is being used for policy decisions and its results interpreted wrongly is a travesty unfolding before our eyes.”

Back to Top

May 4 Revision

On May 4, IHME completely overhauled their previous model and increased their projections from 72k to 132k US deaths by August. Whereas they were previously underprojecting, they are now overprojecting the month of May. At the time of their new update on May 4, there were 68,919 deaths in the US. They projected that there will be 17,201 deaths in the week ending on May 11. In fact, there were only 11,757 deaths. IHME overshot their 1-week projections by 43%. Meanwhile, we projected 10,676 deaths from May 4 through May 11, an error of less than 10%.

IHME went from severely underprojecting their estimates to now overprojecting their estimates, as you can see in the below comparison of May 4 projections. Furthermore, as recently as May 12, they were still projecting 0 deaths by August 4. Their model should not be relied on for accurate projections.

Back to Top

Sample Summary of IHME Inaccurate Predictions

In their April 15 projections, the death total that IHME projected will take four months to reach was in fact exceeded in six days:

  April 21 Total Deaths IHME Aug proj. deaths from Apr 15 Our Aug proj. deaths from Apr 15
New York 19,104 14,542 33,384
New Jersey 4,753 4,407 12,056
Michigan 2,575 2,373 8,196
Illinois 1,468 1,248 4,163
Italy 24,648 21,130 40,216
Spain 21,282 18,713 31,854
France 20,829 17,448 41,643

As you can see above, their models made misguided projections for almost all of the worst impacted regions in the world. The most alarming thing is that they continue to make low projections. Below is their projections from April 21. All of the below projections were exceeded by May 2, just a mere 11 days later:

  May 2 Total Deaths IHME Aug proj. deaths from Apr 21 Our Aug proj. deaths from Apr 21
New York 24,198 23,741 35,238
New Jersey 7,742 7,116 13,651
Michigan 4,021 3,361 6,798
Illinois 2,559 2,093 6,653
Italy 28,710 26,600 44,683
Spain 25,100 24,624 31,854
France 24,763 23,104 41,643

As scientists, we update our models as new data becomes available. Models are going to make wrong predictions, but it’s important that we correct them as soon as new data shows otherwise. The problem with IHME is that they refused to recognize and update their wrong assumptions for many weeks. Throughout April, millions of Americans were falsely led to believe that the epidemic would be over by June because of IHME’s projections.

On April 30, the director of the IHME, Dr. Chris Murray, appeared on CNN and continued to advocate their model’s 72,000 deaths projection by August. On that day, the US reported 63,000 deaths, with 13,000 deaths coming from the previous week alone. Four days later, IHME nearly doubled their projections to 135,000 deaths by August. One week after Dr. Murray’s CNN appearance, the US surpassed his 72,000 deaths by August estimate. It seems like an ill-advised decision to go on national television and proclaim 72,000 deaths by August only to double the projections a mere four days later.

Unfortunately, by the time IHME revised their projections in May, millions of Americans have heard their 60,000-70,000 estimate. It may take a while to undo that misconception and undo the policies that were put in place as a result of this misleading estimate.

Back to Top

US June-August

As of April 11, IHME projected 225 (0 - 1,180) deaths in the US from June 1 to August 4. While we hope the US only has 225 total deaths from June to August (an average of 3 deaths per day), we believe this is an underestimate.

Update time

New data is extremely important when making projections such as these. That’s why we update our model daily based on the new data we receive. Projections using today’s data is much more valuable than projections from 2-3 days ago. However, due to certain constraints, IHME is only able to update their model 1-2 times a week: “Our ambition to produce daily updates has proven to be unrealistic given the relative size of our team and the effort required to fully process, review, and vet large amounts of data alongside implementing model updates.”

Back to Top

Mobility Data

On April 17, IHME stated that they are incorporating new cell phone mobility data which indicate that people have been properly practicing social distancing: “These data suggest that mobility and presumably social contact have declined in certain states earlier than the organization’s modeling predicted, especially in the South.” As a result, IHME lowered their projections from 68k deaths to 60k deaths by August. Their critical flaw is that they assume a linear relationship between lower mobility and lower infection - this is not the case.

Most transmissions do not happen with strangers, but rather close contacts. Even if you reduce your mobility by 90%, you do not reduce your transmission by 90%. The data from Italy shows that it only reduces by around 60%. That’s the difference between 20k and 40k+ deaths. IHME was likely making the wrong assumption that a 90% reduction in mobility will decrease transmission by 90%. Here is a compilation from infectious disease expert Dr. Muge Cevik showing that household contacts were the most likely to be infected.

We posted a Tweet on April 11 about MTA (NYC) and BART (Bay Area) subway ridership being down 90% in March. However, the deaths have only dropped around 25% in NY, while CA has yet to see sharp decrease in deaths in April, more than a month after the drop in ridership.

Interestingly, after IHME suddenly revised their projections from 72k to 130k on May 4, the director of IHME offered this explanation for why they raised their estimates: “…we’re seeing just explosive increases in mobility in a number of states that we expect will translate into more cases and deaths.” This is directly contradictory to their press release just 2 weeks earlier stating that mobility has been lower than predicted. Any 2-week differences in mobility should not explain this sudden jump in projections - only a flawed methodology would.

Back to Top

State Reopening Timeline

In their April 17 press release, IHME released estimates of when they believe each state will have a prevalence of fewer than 1 case per 1 million. They noted that 35 states will reach under 1 prevalent infection per 1 million before June 8, and that “states such as Louisiana, Michigan, and Washington, may fall below the 1 prevalent infection per 1,000,000 threshold around mid-May.”

As of May 15, Louisiana, Michigan, and Washington are reporting 30-90 confirmed cases per million each day. Furthermore, prevalent infections are 5-15x higher than reported cases since most cases are mild and thus not tested/reported. As a result, we estimate Louisiana and Michigan to have around 7,000 prevalent infections per million, which is 7,000 times higher than IHME’s April 17 estimates. An analysis for many of the remaining states show a similar high degree of error. Hence, IHME’s estimates have been off by a factor of more than 3 orders of magnitude.

Unfortunately, it is likely that many individuals and policy-makers used IHME’s misguided reopening timelines to shape decisions with regards to reopening. Their reopening timelines were picked up and widely disseminated by many media outlets, both local and national. Any policies guided by these estimates can have repercussions weeks and months down the road.

Back to Top

Technical Flaw

May 4 Update: IHME completely overhauled their previous model to now use an SEIR model. Our model is based in SEIR and that has not changed since we first began making projections on April 1.

On top of everything we mentioned above, their model is also inherently flawed from a mathematical perspective. They try to model COVID-19 infections using a Gaussian error function. The problem is that the Gaussian error function is by design symmetric, meaning that the curve comes down from the peak at the same rate as it goes up. Unfortunately, this has not been the case for COVID-19: we come down from the peak at a much slower pace. This leads to a significant under-projection in IHME’s model, which we have thoroughly highlighted. University of Washington biology professor Dr. Carl T. Bergostrom discussed this in more detail in this highly informative series of Tweets.

Click here to see how our projections have changed over time, compared with the IHME model. For a comparison of April projections for several heavily-impacted states and countries, click here.

To conclude, we believe that a successful model must be able to quickly determine what is realistic and what is not, and the above examples highlights our main concerns with the IHME model.

Back to Top

Data and Output

To make our projections, we use the daily death total provided by Johns Hopkins CSSE, what is considered by experts to be the “gold standard” reference data. We do not use case-related data in our modeling due to reasoning alluded to here.

Every day, raw daily projections for all 50 US states and select international countries will be uploaded onto our GitHub page. We are projecting future deaths as reported by Johns Hopkins CSSE. For the US, this includes both confirmed and probable deaths.

Back to Top

Assumptions

Social Distancing

Heavy vs moderate social distancing

Heavy social distancing is what many states and countries enacted in the initial stages of the epidemic: stay-at-home orders, closed non-essential businesses, etc. Infection rates typically decrease ~60%, going from an R0 of around 2-3 to an R of 0.6-1.0. As long as R, a measure of how many people an infected person infects on average, is less than 1, infections will decrease over time. If R is greater than 1, then the infection curve will rise. Hence, the ultimate goal is to keep R under 1.

Moderate social distancing is what we assume will happen once states and countries gradually begin relaxing their social distancing guidelines. Some establishments will reopen, but people will still be somewhat cognizant about maintaining social distancing. Most states and countries will have guidelines that aim to maximize social distancing and minimize close contact, such as enforcing capacity limits and recommending mask-wearing. We assume that infection rates will increase 0-20%, resulting in an R of around 0.8-1.2. This is based on analysis of R values in regions where there were no lockdowns, such as Sweden and South Dakota. Note that this is still a lower infection rate than what it was prior to the outbreak for most regions.

If regions impose stricter social distancing guidelines than our assumptions listed above, then we will likely see a lower infections and death rate than the current projections. Conversely, if regions impose looser guidelines, then we will likely see a higher infections and death rate. For example, if California reopens before June 1, there will be an increased chance of an earlier resurgence. Or if a state required all residents to wear masks, the likelihood of a steep increase in infections will decrease, according to some recent studies ([1], [2], [3]).

Second wave

In regions where the outbreak has not yet been fully contained, it is possible that reopening will cause a second wave of infections if states fail to maintain sufficient social distancing.

Second lockdown

We assume that states with a second outbreak will take actions to reduce transmission, such as increased contact tracing, mandatory mask wearing, improved treatment, etc. In the case where the infections curve continue to rise exponentially after a reopening, it may become necessary for regions to impose additional mitigation measures, perhaps even a second lockdown. A second lockdown was seen in numerous Asian countries where a second wave occurred, including in Japan, Hong Kong, and Singapore. Our model incorporates the concept of a second lockdown, which we estimate will happen approximately 30 days after the reopening. Additional mitigation strategies are only necessary if the effective reproduction number (R) after reopening is significantly greater than 1.

Infections Estimate

The current and total infections estimates in our projections are at the core of our SEIR model. We use those estimates to make forecasts regarding future deaths according to the specifications of the SEIR model. The total infections estimate includes all individuals who have ever been infected by the virus, including asymptomatic individuals as well as those who were never tested. The current infections estimate is based on how many people are currently infected at that time point (total - recovered). To compute current infections, we assume that individuals are infected for an average of 15 days. We estimate that the true number of total infections is likely 5-15x higher than reported cases for most regions.

Infection Fatality Rate (IFR)

Jun 1 Update: Given new data, we have changed our model to use a variable IFR that decreases over time to reflect improving treatments and the lower proportion of care home deaths. We decrease the IFR linearly over the span of 3 months until it is 75% of the original IFR.

We estimate that true mortality rate (IFR) for COVID-19 in the US is between 0.9-1.2%. This matches a May 7 study that estimates the IFR to be slightly less than 1.3% after accounting for asymptomatic cases. We also found that most countries in Europe (with the the exceptions of United Kingdom, Spain, and Eastern Europe) have an IFR closer to 0.75%, which matches this May 6 study. Hence, in our projections, we use 0.75% for those European countries and 1% for all US states and other countries.

Recent global and US studies point to a 1% IFR as a reasonable estimate.

Back to Top

Limitations

We want to be as clear as possible regarding what our model can and cannot do. While we try our best to make accurate projections, no model is perfect. The future is not set in stone: a single policy change or a small change in the assumptions can cause a large impact in how the epidemic progresses.

That’s why in addition to our most likely estimate, we also provide a 95% confidence interval to reflect this uncertainty. For example, if we predict 150,760 deaths with a range of 88-294k, it means that there is roughly a 95% chance that the true deaths will be between 88-294k. Note that these confidence intervals are generated given that our above assumptions hold true. There are many real-world variables that can cause our assumptions to be inaccurate and affect the true outcome. We will try our best to address any inaccurate assumptions as time goes on.

We want to caution against focusing on one particular number as the outcome of this model. We are in fact projecting a range which includes a most likely outcome. If the true results fall within the range, that is within the expected outcome of this model. We highly recommend that you include our range when referencing our projections (i.e. 21,342 (15-34k) deaths).

Additional limitations

While we attempt our best to ensure accuracy and precision, no model is perfect, so we urge everyone to use caution when interpreting these projections. This is just one particular model, so we encourage everyone to evaluate and be open to multiple sources. At the end of the day, the decision-making rests in the hands of people, not machines.

Back to Top

Historical US Projections

Below, we show how our (C19Pro) August 4 projections for the US has changed over time, compared to IHME. We also show a comparison of the latest projections.

Note that for the entire month of April, IHME projected between 60,000-73,000 deaths by August, all while deaths increased by an average of 2,000 per day. All of their August projections from April were surpassed by May 6.

Also note that while we update our projections daily, IHME only updates their projections once or twice a week.

Back to Top

Online Coverage

Government

Media

Back to Top

Who We Are

covid19-projections.com is made by Youyang Gu, an independent data scientist. Youyang completed his Bachelor’s degree at the Massachusetts Institute of Technology (MIT), double majoring in Electrical Engineering & Computer Science and Mathematics. He also received his Master’s degree at MIT, completing his thesis as part of the Natural Language Processing group at the MIT Computer Science & Artificial Intelligence Laboratory. His expertise is in using machine learning to understand data and make accurate predictions. You can contact him on Twitter or by using the Contact page.

Back to Top

Updates

2020-05-26

2020-05-25

2020-05-19

2020-05-16

2020-05-14

2020-05-13

2020-05-12

2020-05-06

2020-04-30

2020-04-28

2020-04-26

2020-04-24

2020-04-23

2020-04-20

2020-04-15

2020-04-13

2020-04-12

2020-04-09

2020-04-08

2020-04-07

2020-04-05

2020-04-04

2020-04-03

2020-04-02

2020-04-01

2020-03-30