Making suicide risk prediction models more complicated doesn't necessarily mean that they will do a better job. That's the biggest takeaway from a new study led by researchers at Kaiser Permanente Washington Health Research Institute (KPWHRI) and published in npj Digital Medicine. The study compared the performance of 5 different models developed to identify a patient's risk of suicidal behavior. An ensemble model (a combination of different model types) gave slightly better results than the others that were tested, but the improvement over a much simpler logistic regression model was minimal.
"What our results showed was that adding lots of extra information on the timing of predictors — for example, how long ago a depression diagnosis occurred — and using more complex machine learning approaches didn't substantially improve the performance of the algorithm," said Susan Shortreed, PhD, lead author of the study and a senior biostatistics investigator at KPWHRI. "This is good news for health care organizations that are currently using smaller models in clinical care or that are interested in implementing models like these."
Models designed to help predict suicide attempts take in information from the electronic health record and responses to self-report questionnaires from patients, then use different types of algorithms to estimate suicide risk. When building models, researchers often think carefully about what data they need and what different patterns the algorithm should be allowed to explore to make predictions.
For this study, the researchers compared 5 models (see Types of models tested to learn more):
Predictors included information such as prescription fills, health care visits, and timing and frequency of mental health diagnoses. The researchers used data from more than 25 million health care visits across 7 different health care organizations to train models and evaluate how well they predicted suicide attempt risk 30 days and 90 days after a patient was seen.
The ensemble model performed the best of the 5 models tested, but the difference was small. A common measure of model performance is the “area under the curve,” which essentially measures how well the model ranks individuals — that is, how often individuals who experience the outcome have a higher predicted probability than those who do not experience the outcome. Area under the curve ranges from 0.50 to 1, where a measure of 0.5 is a model performance no better than flipping a coin, while a measure of 1 represents perfect prediction. All of the models tested had an area under the curve above 0.79, and the improvement in the area under the curve of the ensemble model compared to the previously developed logistic regression model was only 0.006 to 0.020. The pattern in performance for all the models was consistent across subgroups broken out by race, ethnicity, and sex, and for both follow-up time periods.
Smaller models, with fewer predictors, are easier to implement because they use less computing power and are less likely to slow down the electronic health record system. They can also be more straightforward to explain to clinicians, who want to understand how the models work when they are relying on the results to inform follow-up with their patients. The researchers wrote that transparency and trust were an important factor in successfully using risk models in a health care setting.
"Reducing fatal and nonfatal self-harm is a crucial public health priority," Shortreed said. "Models do better at identifying risk than self-report surveys alone and can therefore be a very useful tool for delivery of suicide prevention programs, but it's important to understand the costs and benefits associated with their implementation and use. The more we can share with clinicians about what to expect and how to interpret the results from these models, the more likely it is that they can be leveraged effectively."
As researchers have previously pointed out, risk modeling is not designed to replace evaluation by a physician or mental health professional. Currently, suicide risk models used in clinical care act as a supplemental tool, alerting providers to conduct additional assessments with patients and directing evidence-based interventions. There is potential for error and harm if model predictions are used to inform coercive care measures, such as involuntary psychiatric holds. Health care organizations using information from models also need to consider that harms and benefits can vary across the population. For example, welfare or wellness checks conducted by police are not a supportive measure for everyone and can result in additional harms for groups who have been shown to have disproportionately negative consequences in interactions with the police.
Shortreed added that curating the selection of predictors while building a model is an important factor in how the models perform. Investing time and expertise in deciding which predictors to use, and having the ability to explain to clinicians why those predictors were chosen, can result in stronger models that are more trusted and more widely used.
Other KPWHRI coauthors on the study are Rod Walker, Eric Johnson, Robert Wellman, Maricela Cruz, Rebecca Ziebell, Yates Coley, Robert Penfold, and Gregory Simon.
Logistic regression is a regression approach designed for predicting outcomes that only take on the value of a 0 (for example, no self-harm observed) or a 1 (for example, self-harm observed). Regression is the most common analytic way to describe relationships between predictors and outcomes. Regression models are straightforward, as the regression equation uses the mathematical equation for a line, which is easy to compute using basic mathematical functions such as addition and multiplication. But this simplicity can make it difficult to account for complex relationships between risk factors and outcomes.
Random forests average predictions across a set of analytic decision trees. An analytic decision tree divides the data into groups with similar values of predictors and outcomes (for example, most people in a group have self-harm observed following a visit). A random forest is made up of many trees to try to capture complex relationships between predictors (for example, depressive symptoms) and the outcome (for example, self-harm). More complex trees can be created by dividing the data on a larger number of variables, resulting in many smaller groups. Random forests can be used to learn more complex relationships than logistic regression, but at the cost of transparency (there isn’t a simple formula or visualization of a random forest) and speed of making predictions (often slower than logistic regression).
An artificial neural network is a sequence of regression models with many predictors. Rather than directly predicting the outcome, the algorithm creates combinations of the predictors (for example, multiplying many predictors together). The predictor combinations are then used in a different regression model to predict the outcome. Artificial neural networks can be used to learn even more complex relationships than random forests but can be difficult to estimate accurately because of these many “layers” of estimation (and they often take a lot of time to estimate). Lots of information is needed to learn the optimal combination of predictors in addition to the relationship between the combined predictors and the outcome.
By Amelia Apfel
Dr. Greg Simon tells why adopting new suicide-prevention findings avoided delays that plague research implementation.
'Nice warning light,' Dr. Greg Simon says, 'but my hands stay on the wheel,' balancing models with judgment.
How our Learning Health System Program is using statistical methods and machine learning
Kaiser Permanente researchers stress need to test how prediction models perform in all racial, ethnic groups.
Findings provide roadmap for addressing barriers and improving suicide prevention.