Table of contents

About the Model

Our COVID-19 prediction model adds the power of artificial intelligence on top of a classic infectious disease model. We developed a simulator based on the SEIR model (Wikipedia) to simulate the COVID-19 epidemic in each region. The parameters/inputs of this simulator are then learned using machine learning techniques that attempts to minimize the error between the projected outputs and the actual results. We utilize daily deaths data reported by each region to forecast future reported deaths. After some additional validation techniques (to minimize a phenomenon called overfitting), we use the learned parameters to simulate the future and make projections.

Our SEIR model is open source. Our projections are uploaded daily onto GitHub. Everything is written in Python 3, and plotly is used for plotting.

The goal of this project is to showcase the strengths of artificial intelligence to tackle one of the world’s most difficult problems: predict the track of a pandemic. Here, we use a pure data-driven approach by letting the machine do the learning. We are currently making projections for: the United States, all 50 US states (plus DC, PR, VI, GU, MP) and 70 countries (including all 27 EU countries). Combined, these 71 countries account for >95% of all global COVID-19 deaths. See map below for a visualization of countries we have projections for.

You can also directly access our US and global projections.

Coverage Map

World Map

Back to Top

How Our Model is Different

Back to Top

Historical Performance

Last Updated: September 14

A model isn’t very useful if it’s not accurate. Below is our analysis on how various models considered by the CDC have performed over the past few weeks. Because the CDC receives weekly projections from every Monday, we use projections from past Mondays to evaluate the models.

We have open-sourced the code and data used to evaluate COVID-19 models: We believe in a fully transparent evaluation methodology, and publicly releasing all of our code and data is the best way to do so. Learn more about our evaluation methodology on GitHub.

Click here to see our past weekly performance evaluations and for more explanations behind the evaluations. We believe it’s important to look at past evaluations to get a more comprehensive idea of model consistency/accuracy.

Evaluation of historical 4 week ahead US state-by-state projections

This is a metric that shows the consistency of model projections over the period of several months.

Raw data on GitHub

4 week ahead states comparison

Evaluation of historical 4 week ahead US nationwide projections

Because US country-wide projections only contains a single forecast per week, there is much higher variance week-to-week compared to state-by-state projections, where there are 50+ forecasts each week. As a result, we believe state-by-state evaluations is a better indicator of model performance. This same concept is why we play 7-game series for NBA/NHL/MLB playoffs.

Raw data on GitHub

4 week ahead US comparison

Evaluation of past US state-by-state projections on reported deaths as of September 12

This is a metric that shows the recent accuracy of model projections.

Raw data on GitHub

States comparison

Evaluation of past US nationwide projections on reported deaths as of September 12

Raw data on GitHub

US comparison

Back to Top

Comparison of Late August US Projections

Below, we show our August 27 forecasts compared with IHME’s August 27 forecasts. To view additional comparison plots with IHME, click here.

Back to Top

Comparison of Current US Projections

We compare our current projections to that of the Institute for Health Metrics and Evaluation (IHME). To see our full US projections, click here.

Back to Top

Historical US Projections

Below, we show how our (C19Pro) August 4 and November 1 projections for the US has changed over time, compared to IHME. To see our full current US projections, click here.

To view additional comparison plots with IHME, click here.

Back to Top

CDC Projections Over Time

Below, we present our weekly CDC projections over time.

Back to Top

Data and Output

To make our projections, we use the daily death total provided by Johns Hopkins CSSE, what is considered by experts to be the “gold standard” reference data. We do not use case-related data in our modeling due to reasoning alluded to here. With that said, we do look at case and hospitalization data to help determine the bounds for our search grid, as changes in cases lead changes in deaths.

While we do not use testing data in our model, we sometimes use US testing data from The COVID Tracking Project in our research and graphs.

Every day, raw daily projections for all 50 US states and select international countries will be uploaded onto our GitHub page. We are projecting future deaths as reported by Johns Hopkins CSSE.

Back to Top


Epidemiological Assumptions

We use a considation of resources provided by Models of Infectious Disease Agent Study (MIDAS) to set standard parameters such as incubation and infectious period. Most of these parameters have a wide consensus among experts. For example, we assume a 5-day incubation period (on average) and a 7-day infectious period (on average). These assumptions are probabilistic and roughly normally distributed. This means that an infected individual would be infectious between Day 2 to Day 8 after exposure, with Day 4-6 being the most infectious. For the purpose of calculating current infections, we assume an average individual is infected for 15 days. The exact values of the above parameters do not significantly change our projections.

Not everyone who are “currently infected” are infectious. To get a sense of the number of individuals that are infectious, we recommend dividing the “currently infected” number by half. To get the number of individuals who are at peak infectiousness, we recommend dividing the “currently infected” number by ~5.

Back to Top

Confidence Intervals

The future is not set in stone: a single policy change or a small change in the assumptions can cause a large impact in how the epidemic progresses. That’s why in addition to our mean estimate, we also provide a 95% confidence interval to reflect this uncertainty. For example, if we predict 150,760 deaths with a range of 88-294k, it means that we are 95% confident that the true deaths will be between 88-294k. Note that these confidence intervals are generated given that our above assumptions hold true. There are many real-world variables that can cause our assumptions to be inaccurate and affect the true outcome. We will try our best to address any inaccurate assumptions as time goes on.

In addition to the 95% confidence interval, we present the mean estimate. This value is usually higher than the median/most likely estimate because it is accounting for a longer tail on the higher end of the estimates. So for example, if our mean estimate for September 2020 US deaths is 180k, our median/most likely estimate may be 170k. This is because the upper bound of the deaths is technically unbounded, while the lower bound is bounded by the current death total. This causes a skew in the distribution of death projections, leading to a mean estimate that is higher than the median estimate.

Our daily deaths confidence intervals are meant to be looked at from a rolling mean basis, rather than a daily incident basis. For example, if a state reports deaths every other day (e.g. 0, 200, 0, 100), a confidence interval that covers daily incident deaths can only use [0, 200], which is not very informative. A confidence interval such as [55, 95] would be more informative, despite not overlapping with any of the four daily incident deaths. Hence, we recommend using a 7-day rolling mean when evaluating our confidence intervals.

We want to caution against focusing on one particular number as the outcome of this model. We are in fact projecting a range which includes a mean outcome. If the true results fall within the range, that is within the expected outcome of this model. When citing our projections, we highly recommend including our confidence intervals when referencing our projections (i.e. 21,342 (15-34k) deaths).

Back to Top

Social Distancing

Back to Top

Heavy vs moderate social distancing

Heavy social distancing is what many states and countries enacted in the initial stages of the epidemic: stay-at-home orders, closed non-essential businesses, etc. Infection rates typically decrease ~60%, going from an R0 of around 2-3 to an R of 0.6-1.0. As long as R, a measure of how many people an infected person infects on average, is less than 1, infections will decrease over time. If R is greater than 1, then the infection curve will rise. Hence, the ultimate goal is to keep R under 1.

Moderate social distancing is what we assume will happen once states and countries gradually begin relaxing their social distancing guidelines. Some establishments will reopen, but people will still be somewhat cognizant about maintaining social distancing. Most states and countries will have guidelines that aim to maximize social distancing and minimize close contact, such as enforcing capacity limits and recommending mask-wearing. We assume that infection rates will increase by approximately 0-30%, resulting in an R of around 0.8-1.2. This is based on analysis of R values in regions where there were no lockdowns, such as Sweden and South Dakota. Note that this is still a lower infection rate than what it was prior to the outbreak for most regions.

If regions impose stricter social distancing guidelines than our assumptions listed above, then we will likely see a lower infections and death rate than the current projections. Conversely, if regions impose looser guidelines, then we will likely see a higher infections and death rate. For example, if California reopens before June 1, there will be an increased chance of an earlier resurgence. Or if a state required all residents to wear masks, the likelihood of a steep increase in infections will decrease, according to some recent studies ([1], [2], [3]).

Back to Top

Second wave

In regions where the outbreak has not yet been fully contained, it is possible that reopening will cause a second wave of infections if states fail to maintain sufficient social distancing. We assume that regions that have reopened will take actions to reduce transmission, such as increased contact tracing, mandatory mask wearing, improved treatments, capacity limits, etc. Over time, the aforementioned actions, as well as the natural progression of the virus, will lead to a reduction in the transmission rate.

In states where a second wave is prevalent, infections appear to reach a peak before undergoing a decline, despite a lack of concrete mitigation measures. One theory is that there is a certain subset of the population that are more susceptible to contracting the virus (old age, co-morbidities, unwillingness to take precautions, etc). Once that group is exhausted, it becomes harder for the virus to spread, leading to a decline in transmission despite no government intervention. Note that this is merely a theory to explain the observed data.

As of June 1, our model no longer assumes a second lockdown.

Back to Top


After the initial ramp-up period of a reopening (1-2 months), we assume that the spread will decrease over time due to improvements in contact tracing, increased mask wearing, greater awareness within the population, and increased population immunity. In the initial stages of the reopening, this phenomenon will likely be dwarfed by the act of the reopening itself, hence leading to a plateau or increase in cases. But after the rate of reopening for a region has plateaued 1-2 months later, we expect to see a gradual decline in transmission and hence a decline in infections and deaths. Of course, this assumption is highly subject to change based on the data.

Looking at the data, we noticed that as various states reach 10-35% prevalence, infections begin to slow down, despite no significant interventions. This seem to suggest that the effective herd immunity threshold under the current conditions of social distancing and intervention measures may be lower than the 60-80% values previous reported in March/April. Nevertheless, it’s important to note that transmission does not stop once HIT is reached - it simply slows down. See our write-up, Estimating True Infections, for a more in-depth analysis on this topic.

Starting on July 22, we use two logistic (sigmoid) functions to approximate the R_t curve from the reopening. We use two parameters, the maximum reopen R_t and the inflection rate to determine the shape. These two parameters are then learned by our machine learning layer based on the data. You can learn more by looking at our open-source code.

Prior to July 22, we assume a very small daily decay in the transmission rate (R) starting from roughly 30 days after reopening (~0-0.5%). The decay is compoundly applied until the R drops below 1, at which point we stop applying further decays. As the exact value of the decay is unknown ahead of time, we initially sample this decay from a random distribution. As time goes on and we obtain more data regarding the post-reopening effects, our model will learn this decay.

Back to Top

Fall Wave

The future is uncertain, and many things can happen between now and fall that will change the trajectory of this epidemic. While we believe a September increase in deaths is unlikely, we do think it is possible that the rate of transmission may increase as we head towards winter. A few reasons for that include: seasonality of the virus, more time spent indoors, increased mobility as schools reopen and people return to work, and the potential loss of acquired immunity.

We currently assume a 0-0.5% daily increase in the transmission rate (R_t) starting in the end of summer (August). Initially, we randomly sample this value from a triangle distribution in our simulations. This results in a wider confidence interval to account for the increased uncertainty. As more data comes in over time, our machine learning algorithm will be able to better learn this value. It’s important to recognize that a fall wave is only a possibility and is not guaranteed.

We are currently not changing the infection fatality rate (IFR) from the summer. But there has been studies (Kifer et al.) showing that the fatality rate may increase during the winter months due to factors such as lower indoor humidity.

Because our point estimates are mean estimates rather than median estimates, it is possible for our Rt mean estimates to remain below 1 while the new infections mean estimates increase (due to the skewness of the distribution).

Note: We currently do not explicitly model school reopenings. This is a situation we will continue to monitor. We believe it is possible that we will see an increase in infections due to school reopenings, but it is unclear to what extent this will translate to deaths.

Back to Top

Infections Estimate

The current and total infections estimates in our projections are at the core of our SEIR model. We use those estimates to make forecasts regarding future deaths according to the specifications of the SEIR model. The total infections estimate includes all individuals who have ever been infected by the virus, including asymptomatic individuals as well as those who were never tested. The current infections estimate is based on how many people are currently infected at that time point (total - recovered). To compute current infections, we assume that individuals are infected for an average of 15 days. We estimate that the true number of total infections is likely 5-15x higher than reported cases for most regions.

Back to Top

Effective Reproduction Number (R)

One of the most important properties for any infectious disease is the basic reproduction number, known as R0. Rather than pre-setting this value based on assumptions, our model is able to learn the value that most closely matches the data. For Italy, the R0 is found to be around 2.4-2.8, while for New York City, the R0 is 5.4-5.8. This means that on average, an infected person in New York City will infect 5.4 to 5.8 additional people. For most regions, the R0 is found to be around 2, which matches the WHO findings. We are able to generate a plot of how the R value changes over time for all of our projections. To see our estimates of R values for every state and country, see our Infections Tracker page.

Our R estimates are merely estimates rather than precise values, and is only based on deaths data. We correct for reporting lags, so how deaths are changing today is a reflection of how the R value was changing 3-4 weeks ago. We then apply additional assumptions explained in this section to interpolate the R value since then. As a result, the current R value estimates are more of a byproduct of our assumptions than a result of any measurable data. As we receive more data in the future, we then update our R estimates to most closely reflect the observed data. is a good resource for looking at R_t estimates using case data rather than deaths data.

Back to Top

Infection Fatality Rate (IFR)

Note that our IFR estimates is subject to change based on new data. The exact IFR value does not significantly affect our death estimates.

August 5 Update: See our writeup, Estimating True infections, for our in-depth analysis on the infection fatality rate and its relationship with cases, deaths, and test positivity rates.

We estimate that infection fatality rate (IFR) for COVID-19 in the US through April is between 0.9-1.2%. This matches a May 7 study that estimates the IFR to be slightly less than 1.3% after accounting for asymptomatic cases. We also found that most countries in Europe (with the the exceptions of United Kingdom, Spain, and Eastern Europe) have an IFR closer to 0.75%, which matches this May 6 study.

Prior to June, we use the following initial IFR in our projections:

Since June 1, 2020, we use a variable IFR that decreases over time to reflect a lower median age of infections, improving treatments, and seasonality of the virus. Hence, we decrease the initial IFR linearly over the span of 3 months until it is 30% of the original IFR. The initial IFR for reach region is determined by looking at the case fatality ratio, and ranges from 0.5%-1.5%. Through the end of April, we estimate the IFR in the US to be around 0.7%, which is corroborated by CDC’s best estimate scenario, which cites a study from April. By August 2020, we estimate the IFR to be 0.2-0.4% in most of the US and Europe. For later-impacted regions like Latin America, we wait an additional 3 months before beginning to decrease the IFR.

Recent global, Europe, and US studies point to a 0.5-1% IFR to be a reasonable estimate. One of the largest antibody studies thus far estimated a 1.2% IFR for Spain.

Our IFR is based on infectious individuals only. WHO announced in early June that asymptomatic individuals may not be infectious. However, CDC and other studies (#2) have shown that there is “no statistically significant difference in the viral load of symptomatic versus asymptomatic infections”. For our modeling purposes, we do not account for individuals that are not infectious, since they do not contribute to the spread of the virus. Note that asymptomatic individuals are different than pre-symptomatic individuals.

Lastly, we want to note that our IFR estimate is based on reported deaths, rather than true deaths. Since the US and many other regions around the world regularly underreport COVID-19 deaths, our IFR estimates is likely to be a lower bound for the true IFR.

Back to Top

Undetected Deaths

In our June 15 model update, we incorporated the concept of undetected deaths to better estimate the number of true infections in the early stages of the pandemic. In the first weeks of the pandemic for each region, we assume a significant percentage of deaths will be undetected/unreported due to a lack of testing. We assume that this percentage will decrease over time until it reaches a negligible amount. So if there are 100 true deaths and 20% are undetected, then only 80 deaths will be reported/projected. While it is possible that the undetected deaths ratio may be higher, the exact value does not signficantly affect our projections.

As a result of this update, the number of true infected individuals in our projections have increased. However, we believe that even after this update, that we are still undercounting the true deaths in a region.

For further analysis of “excess deaths”, see official CDC data, reporting by The New York Times, Financial Times, or The Economist.

Back to Top


We want to be as clear as possible regarding what our model can and cannot do. While we try our best to make accurate projections, no model is perfect. Here we present some of the known limitations of our model.

Data Accuracy

A model is only as good as the data we feed it. If the data is not accurate, then it would be difficult to make accurate projections downstream. We only use official reported deaths in our modeling.

Back to Top

School Reopening

While we factor in a fall wave in our projections that may result in an increase in transmission, we do not explicitly model school reopenings. As of August, it is still unclear what the effect of school reopenings will be, and how it will differ from district to district and from state to state. We want to wait until we have more data before incorporating this phenomenon into our model.

Back to Top

Death Reporting and Excess Deaths

Some countries report probable deaths while others only report laboratory-confirmed deaths. This difference explains why countries with comprehensive reporting like Belgium have the highest death rates.

On June 8, the Washington Post published an investigation showing that “at least 24 [US] states are not heeding the national guidelines on reporting probable cases and deaths, despite previously identifying probable cases in other national outbreaks.” We made a series of Tweets about it here.

Differences in how countries/states report deaths can lead to unfair comparisons and also skew projections. For example, New York City reported close to 5,000 probable deaths between April 14-23, but have not reported any probable deaths since. This was an increase of 30% over the existing death total at the time. As a result, early April projections under-projected the number of deaths for New York, while our late April and early May projections over-projected the number of deaths for New York.

Because the accuracy of our projections rely on consistent reporting of deaths, any inconsistencies may skew our projections.

While we attempt to predict the official death total, the true death total will be higher due to underreporting at various levels. The New York Times, The Economist, and the Financial Times are currently tracking these excess deaths. Also see work by the Weinberger Lab at Yale School of Medicine.

Back to Top

Additional Limitations

While we attempt our best to ensure accuracy and precision, no model is perfect, so we urge everyone to use caution when interpreting these projections. This is just one particular model, so we encourage everyone to evaluate and be open to multiple sources. At the end of the day, the decision-making rests in the hands of people, not machines.

Back to Top

Twitter Threads

Last Updated: September 11

Due to the fast-paced environment of COVID-19 developments and our limited team (of one), Twitter is our preferred method of posting announcements, sharing results and fasciliating discussion. Below is a comprehensive list of our Tweets since late April:


August 10 - Role of immunity vs behavior vs interventions (Louisiana case study)
August 5 - Estimating true infections
July 24 - Implied infection fatality rate


September 4 - Forecasting uncertainty
August 27 - Following the science
August 13 - Talk announcement
August 12 - Common pitfalls when estimating true infections
July 31 - True infections estimates
July 19 - CDC Forecast Hub
July 13 - Deaths lag infections
July 11 - Rise in infections/deaths
July 9 - Testing bottleneck
July 3 - Sample comparison with IHME
July 2 - Second wave
June 25 - Trivia
June 24 - Open source SEIR simulator
June 21 - IHME’s lack of representation
June 13 - Subway ridership analysis
June 12 - IHME September projections
June 9 - Probable deaths reporting
May 29 - Self-quarantine scenario
May 28 - CNN projection accuracy
May 20 - OpenTable analysis
May 16 - Seroprevalence in Stockholm
May 16 - New feature: Rt over time
May 15 - Shopping simulation
May 14 - Social distancing 1 week earlier scenario
May 13 - New feature: no reopenings
May 2 - NY serology survey
April 30 - Infections fatality rate
April 29 - Model added to CDC website
April 28 - NYC subway ridership analysis
April 26 - Rt estimates
April 25 - Death milestone probabilities

Projections updates

September 9
September 1
August 25
August 18
July 29
July 23
July 17
July 8
June 19
June 11
June 1
May 13
May 1
May 26
May 27
May 20

Model evaluations

July 14
June 29 - Baseline comparison with IHME
June 23
June 17 - Open source model evaluation
May 18
May 17
May 9
May 4
April 27
April 24


June 27
June 24
June 15
June 8
June 12
June 6
June 5

Testing targets

June 7
May 30
May 24
May 17

Back to Top

Concerns with the IHME model

In this section we will compare our projections with a popular model developed by the Institute for Health Metrics and Evaluation (IHME) and commonly referred to by the White House and media.

We present a series of Tweets highlighting the issues with the IHME model:

Sep 11
July 3
June 29
June 21
June 12
May 9
April 20
April 12

Below, you can find a comparison of our past projections (C19Pro) with IHME for the US, New York, New Jersey, and California, some of the most heavily impacted regions. You can find comparisons of late May projections in the Historical Performance section. Click here to view additional comparison plots.

As you can see from the graphs above, IHME’s projections have historically failed to accurately capture the true trajectory for these regions. Below, we will go into further details as to why IHME has been and still is a flawed model.

There are existing news articles such as Vox, STAT News, CNN, and Quartz that agree with our concerns.

In the words of Ruth Etzioni, an epidemiologist at Seattle’s Fred Hutchinson Cancer Research Center, “that [the IHME model] is being used for policy decisions and its results interpreted wrongly is a travesty unfolding before our eyes.”

Back to Top

Baseline Comparison: C19Pro vs IHME

If on June 1, you simply assume each state/country’s average daily deaths from the week before will be unchanged for the next 4 weeks, you can make a better forecasts than IHME. This is equivalent to extending a straight line on the daily deaths plots.

Baseline comparison US

Baseline comparison Global

The pattern is similar for other dates as well. See our open source evaluation for more.

Back to Top

Late May Projections

Below you can find some of our late May projections for 4 of the most heavily impacted states since reopening: Florida, California, Arizona, Texas.

Click here to view more plots of historical projections.

Back to Top

Comparison of Data Sources

Here is a comparison of the data sources we use in our model versus what IHME uses (from their June 11 press release). More is not always better. IHME
Daily deaths Daily deaths
  Case data
  Testing data
  Mobility data
  Pneumonia seasonality
  Mask use
  Population density
  Air pollution
  Low altitude
  Annual pneumonia death rate
  Smoking data
  Self-reported contacts

May 4 Revision

On May 4, IHME completely overhauled their previous model and increased their projections from 72k to 132k US deaths by August. Whereas they were previously underprojecting, they are now overprojecting the month of May. At the time of their new update on May 4, there were 68,919 deaths in the US. They projected that there will be 17,201 deaths in the week ending on May 11. In fact, there were only 11,757 deaths. IHME overshot their 1-week projections by 43%. Meanwhile, we projected 10,676 deaths from May 4 through May 11, an error of less than 10%.

IHME went from severely underprojecting their estimates to now overprojecting their estimates, as you can see in the below comparison of May 4 projections. Furthermore, as recently as May 12, they were still projecting 0 deaths by August 4. Their model should not be relied on for accurate projections.

Back to Top

June 8 Revision

On June 8, IHME again revised their model to show a more realistic August projection. Their August 4 projections has now increased from 0 (0-0) deaths in their May 10 projections to 550 (264-1203) deaths in their revised June 8 projections. As the saying goes, better late than never.

IHME May 10 projections IHME June 8 projections

Back to Top

June 10 Revision

In their June 10 update, IHME is projecting deaths to decrease from June through August, and then increase from 400 deaths per day on September 1 to 1,000 deaths per day on October 1. Their press release headline is titled: “IHME models show second wave of COVID-19 beginning September 15 in US”. The cite back-to-school and “pneumonia seasonality” as reasons for this fall spike.

Unfortunately, is no scientific data that supports this claim. In reality, pneumonia/influenza deaths are actually the lowest in August and September, according to the CDC. The same pattern holds true for bacterial pneumonia. Regarding back-to-school, schools in Europe have managed to successfully reopen with no rise in cases. Furthermore, children (age <18) account for less than 2% of all reported COVID-19 cases. Hence, it makes little sense for deaths to decrease when all of America goes back to work, but for deaths to increase when children go back to school.

Back to Top

Sample Summary of IHME Inaccurate Predictions

In their April 15 projections, the death total that IHME projected will take four months to reach was in fact exceeded in six days:

  April 21 Total Deaths IHME Aug proj. deaths from Apr 15 Our Aug proj. deaths from Apr 15
New York 19,104 14,542 33,384
New Jersey 4,753 4,407 12,056
Michigan 2,575 2,373 8,196
Illinois 1,468 1,248 4,163
Italy 24,648 21,130 40,216
Spain 21,282 18,713 31,854
France 20,829 17,448 41,643

As you can see above, their models made misguided projections for almost all of the worst impacted regions in the world. The most alarming thing is that they continue to make low projections. Below is their projections from April 21. All of the below projections were exceeded by May 2, just a mere 11 days later:

  May 2 Total Deaths IHME Aug proj. deaths from Apr 21 Our Aug proj. deaths from Apr 21
New York 24,198 23,741 35,238
New Jersey 7,742 7,116 13,651
Michigan 4,021 3,361 6,798
Illinois 2,559 2,093 6,653
Italy 28,710 26,600 44,683
Spain 25,100 24,624 31,854
France 24,763 23,104 41,643

As scientists, we update our models as new data becomes available. Models are going to make wrong predictions, but it’s important that we correct them as soon as new data shows otherwise. The problem with IHME is that they refused to recognize and update their wrong assumptions for many weeks. Throughout April, millions of Americans were falsely led to believe that the epidemic would be over by June because of IHME’s projections.

On April 30, the director of the IHME, Dr. Chris Murray, appeared on CNN and continued to advocate their model’s 72,000 deaths projection by August. On that day, the US reported 63,000 deaths, with 13,000 deaths coming from the previous week alone. Four days later, IHME nearly doubled their projections to 135,000 deaths by August. One week after Dr. Murray’s CNN appearance, the US surpassed his 72,000 deaths by August estimate. It seems like an ill-advised decision to go on national television and proclaim 72,000 deaths by August only to double the projections a mere four days later.

Unfortunately, by the time IHME revised their projections in May, millions of Americans have heard their 60,000-70,000 estimate. It may take a while to undo that misconception and undo the policies that were put in place as a result of this misleading estimate.

Back to Top

US June-August

As recently as May 3, IHME projected 304 (0-1644) total deaths in the US from June 1 to August 4, a span of two months. The US reported 768 deaths on June 1. So a single day’s death total exceeded IHME’s estimate for two months.

Update time

New data is extremely important when making projections such as these. That’s why we update our model daily based on the new data we receive. Projections using today’s data is much more valuable than projections from 2-3 days ago. However, due to certain constraints, IHME is only able to update their model 1-2 times a week: “Our ambition to produce daily updates has proven to be unrealistic given the relative size of our team and the effort required to fully process, review, and vet large amounts of data alongside implementing model updates.”

IHME days before update

Back to Top

Mobility Data

On April 17, IHME stated that they are incorporating new cell phone mobility data which indicate that people have been properly practicing social distancing: “These data suggest that mobility and presumably social contact have declined in certain states earlier than the organization’s modeling predicted, especially in the South.” As a result, IHME lowered their projections from 68k deaths to 60k deaths by August. Their critical flaw is that they assume a linear relationship between lower mobility and lower infection - this is not the case.

Most transmissions do not happen with strangers, but rather close contacts. Even if you reduce your mobility by 90%, you do not reduce your transmission by 90%. The data from Italy shows that it only reduces by around 60%. That’s the difference between 20k and 40k+ deaths. IHME was likely making the wrong assumption that a 90% reduction in mobility will decrease transmission by 90%. Here is a compilation from infectious disease expert Dr. Muge Cevik showing that household contacts were the most likely to be infected.

We posted a Tweet on April 11 about MTA (NYC) and BART (Bay Area) subway ridership being down 90% in March. However, the deaths have only dropped around 25% in NY, while CA has yet to see sharp decrease in deaths in April, more than a month after the drop in ridership.

Interestingly, after IHME suddenly revised their projections from 72k to 130k on May 4, the director of IHME offered this explanation for why they raised their estimates: “…we’re seeing just explosive increases in mobility in a number of states that we expect will translate into more cases and deaths.” This is directly contradictory to their press release just 2 weeks earlier stating that mobility has been lower than predicted. Any 2-week differences in mobility should not explain this sudden jump in projections - only a flawed methodology would.

Back to Top

State Reopening Timeline

In their April 17 press release, IHME released estimates of when they believe each state will have a prevalence of fewer than 1 case per 1 million. They noted that 35 states will reach under 1 prevalent infection per 1 million before June 8, and that “states such as Louisiana, Michigan, and Washington, may fall below the 1 prevalent infection per 1,000,000 threshold around mid-May.”

As of May 15, Louisiana, Michigan, and Washington are reporting 30-90 confirmed cases per million each day. Furthermore, prevalent infections are 5-15x higher than reported cases since most cases are mild and thus not tested/reported. As a result, we estimate Louisiana and Michigan to have around 7,000 prevalent infections per million, which is 7,000 times higher than IHME’s April 17 estimates. An analysis for many of the remaining states show a similar high degree of error. Hence, IHME’s estimates have been off by a factor of more than 3 orders of magnitude.

Unfortunately, it is likely that many individuals and policy-makers used IHME’s misguided reopening timelines to shape decisions with regards to reopening. Their reopening timelines were picked up and widely disseminated by many media outlets, both local and national. Any policies guided by these estimates can have repercussions weeks and months down the road.

Back to Top

Technical Flaw

May 4 Update: IHME completely overhauled their previous model to now use an SEIR model. Our model is based in SEIR and that has not changed since we first began making projections on April 1.

On top of everything we mentioned above, their model is also inherently flawed from a mathematical perspective. They try to model COVID-19 infections using a Gaussian error function. The problem is that the Gaussian error function is by design symmetric, meaning that the curve comes down from the peak at the same rate as it goes up. Unfortunately, this has not been the case for COVID-19: we come down from the peak at a much slower pace. This leads to a significant under-projection in IHME’s model, which we have thoroughly highlighted. University of Washington biology professor Dr. Carl T. Bergostrom discussed this in more detail in this highly informative series of Tweets.

Click here to see how our projections have changed over time, compared with the IHME model. For a comparison of April projections for several heavily-impacted states and countries, click here.

To conclude, we believe that a successful model must be able to quickly determine what is realistic and what is not, and the above examples highlights our main concerns with the IHME model.

Back to Top

Online Coverage



Magazines/General News




Back to Top

Who We Are is made by Youyang Gu, an independent data scientist. Youyang completed his Bachelor’s degree at the Massachusetts Institute of Technology (MIT), double majoring in Electrical Engineering & Computer Science and Mathematics. He also received his Master’s degree at MIT, completing his thesis as part of the Natural Language Processing group at the MIT Computer Science & Artificial Intelligence Laboratory. His expertise is in using machine learning to understand data and make accurate predictions. You can contact him on Twitter or by using the Contact page.

Back to Top