By Gregory Simon, MD, MPH, senior investigator, Kaiser Permanente Washington Health Research Institute (KPWHRI); principal investigator, Mental Health Research Network; and Kaiser Permanente psychiatrist
Projections of the course of the COVID-19 pandemic have prompted vigorous arguments about which disease model performs the best. As competing models deliver divergent predictions of morbidity and mortality, arguments in the epidemiology and medical twitterverse have become heated. Given the rowdy disagreement among experts, how can we evaluate the accuracy of competing models yielding divergent predictions? Isn’t there some way of evaluating models that is more objective than a beauty contest in which judges pick a winner based on looks and reputation? I think that question leads to a key distinction between types of mathematical models.
Most models predicting individual-level events (such as the models predicting suicidal behavior developed by the Mental Health Research Network, or MHRN) follow an empirical or inductive approach. An inductive model begins with “big data”, usually including a large number of events and a large number of potential predictors. The data then tell us which predictors are useful and how much weight each should be given. Theory or judgment may be involved in assembling the original data, but the data then make the key decisions. Regardless of our opinions about whichpredictors might matter, the data dictate what predictors actually matter.
In contrast, models predicting population-level change (including many competing models of COVID-19 morbidity and mortality) often follow a mechanistic or deductive approach. A deductive model assumes a mechanism of the underlying process, such as the susceptible-infected-recovered (S-I-R) model of infectious disease epidemics, which projects over a given time period the number of individuals who are susceptible to infection, are actively infected, or have recovered from infection. Deductive models begin with a presumed process, such as the relationship among the different compartments in the S-I-R model, that arises from theory and expert opinion.
Furthermore, as in the specific case of COVID-19, for instance, epidemiologists attempt to estimate key rates or probabilities, such as the now-famous reproduction number or R0 for COVID-19, which, to translate roughly for non-statisticians, represents the average number of people infected by one infectious individual. They apply those rates or probabilities to project how a pandemic will spread, but that can lead to problems: Such rates are usually estimated from multiple sources and involve at least some expert opinion or interpretation. That’s a critical difference from the empirical/inductive approach, which does not rely on making assumptions about the process and about key elements in the model.
Judging the performance of empirical or inductive prediction models follows a standard path. At a minimum, we randomly divide our original data into one portion for developing a model and a separate portion for testing or validating it. Before using a prediction model to inform practice or policy, we would often test how well it travels – by testing or validating it in data from a different time or place. So far, our MHRN models predicting individual-level suicidal behavior have held up well in all of those tests. Inductive models are also used in the KPWHRI Learning Health Systems Program and have undergone a similar testing process.
That empirical validation process is usually neither feasible nor reasonable with mechanistic or deductive models, especially in the case of an emerging pandemic. If our observations are countries rather than people, we lack the sample size to divide the world into a model development sample and a validation sample. And it makes no sense to validate COVID-19 predictions based on how well they travel across time or place. We already know that key factors driving the pandemic vary widely over time and place. We could wait until the end of September to see which COVID-19 model made the best predictions for the summer, but that answer will arrive too late to be useful. Because we lack the data to judge model performance, the competition can be as subjective as a beauty contest.
I am usually skeptical of mechanistic or deductive models. Assumed mechanisms are often too simple. In April, some reassuring models used Farr’s Law to predict that COVID-19 would disappear as quickly as it erupted. Dr. William Farr, an English epidemiologist considered to be one of the founders of medical statistics, showed in the late nineteenth century that epidemics rise and fall in roughly a bell-shaped curve, and his observation was used accurately in some epidemics since then. Unfortunately, COVID-19 didn’t follow that law.
Even when presumed mechanisms are correct, estimates of key rates or probabilities in mechanistic models often depend on expert opinion rather than data. Small differences in those estimates can lead to marked differences in final results. In predicting the future of the COVID-19 pandemic, small differences in expectations regarding the reproduction number or case fatality rate lead to dramatic differences in expected morbidity and mortality.
When we have the necessary data, I’d rather remove mechanistic assumptions and expert opinion estimates from the equation. But we sometimes lack the data necessary to develop empirical or inductive models – especially when predicting the future of an evolving epidemic. So we will have to live with uncertainty – and often with vigorous arguments about the performance of competing models. Rather than trying to judge which COVID-19 model performs best, I’ll stick to what I know I need to do: Avoid crowds (especially indoors), wash my hands, and wear my mask!
How our Learning Health System Program is using statistical methods and machine learning
Now more than ever we can benefit from studies embedded in health-care systems to answer common clinical questions.
Led by KPWHRI scientists, study aims to target those who need interventions most.