David Carrell, PhD

“My work uses computers to mine and analyze information about patients’ health from the millions of clinical notes Kaiser Permanente Washington doctors and nurses write about their patients in a typical year.”

David Carrell, PhD 

Assistant Investigator, Kaiser Permanente Washington Health Research Institute 

Areas of focus:


David Carrell, PhD, is an assistant investigator whose most significant contributions to science entail the development and application of technologies for extracting rich information from unstructured clinical text, such a physician progress notes. This work uses state of the art clinical natural language processing (NLP) technologies in single- and multi-site settings.

Precision phenotyping, which is the process of using information from electronic health records (EHR) to identify patients with (or without) specific clinical characteristics, is a major focus of Dr. Carrell’s work. An example of this work is the development of an NLP system to identify women who have been diagnosed with recurrent breast cancer. Despite it being such a common and consequential clinical diagnosis, recurrent breast cancer is not a condition that can be identified from standardized diagnosis codes found in a person’s chart. Such codes appear in a chart to justify, for example, an imaging procedure when there is suspicion of possible recurrence or as part of a standard monitoring plan. In such cases, a code for recurrence will appear in the chart, but the patient’s chart notes will also state that the imaging study did not find evidence of disease.

For many years, researchers had desired a robust, automated method to identify women with breast cancer recurrences, yet it remained an unsolved problem because of the difficulty of isolating true positive cases (women with diagnosed recurrence) from false positive cases (women for whom there was suspicion of disease but none was found).

Our solution, supported by a grant from the National cancer institute (“Natural Language Processing for Cancer Research Network Surveillance Studies,” RC1CA146917, Carrell, PI), incorporated information from clinician progress notes, radiology reports, and pathology reports. The NLP algorithm achieved very high sensitivity (equal to that of expert manual chart abstractors). Published in the American Journal of Epidemiology, this was the first major report in that journal illustrating the power of clinical NLP methods.

Working with teams of researchers inside and outside KPWHRI, Dr. Carrell has applied similar precision phenotyping methods to identify patients with stenosis of the carotid artery, colon polyps, and problem use of prescription opioids. He is also using these methods to help health care systems evaluate the performance of physicians conducting screening colonoscopy examinations.

Dr. Carrell has also developed and applied computing methods for automated de-identification of clinical text. Removing identifiers is important for preserving patient privacy when clinical text is shared with scientific collaborators outside Kaiser Permanente Washington. Automated de-identification methods make such sharing efficient and scalable.

In collaboration with scientists at MITRE Corporation and Vanderbilt University, Dr. Carrell has used a method called “machine learning” to train software systems to recognize patient identifiers in clinical text, even when the system encounters names, dates, and other identifiers it has never before seen.

Recently, this work was supported by the National Library of Medicine through a grant called “Scalable and Robust Clinical Text De-Identification Tools.” Also relevant to this work are studies Dr. Carrell has led investigating the cost and accuracy of creating manually annotated sets of documents used to training machine learned systems, and the vulnerability to reverse-engineering attacks of clinical text de-identified with automated methods.

Conducting research in diverse health care settings and patient populations is important to making our scientific discoveries robust and widely applicable. Throughout his career Dr. Carrell has participated in a number of large-scale multi-site research projects. Recent examples include projects in the Health Care Systems Research Network (HCSRN) and the Electronic Medical Records and Genomics (eMERGE) network. The latter is a twelve-site study exploring genetic predictors of disease in very large patient populations based on precision phenotypes derived from electronic medical records data. Dr. Carrell has also participated studies sponsored by the Patient Centered Outcomes Research Institute (PCORI) and the National Institutes of Health (NIH) Collaboratory.


  • Clinical Natural Language Processing

    Recurrent breast cancer; Colonoscopy quality; Extracting information from clinical text; Methods for using NLP methods in multi-site research

  • Clinical Text De-identification

    Automated methods for removing patient identifiers from clinical text; Vulnerability of automated de-identification methods to malicious attack

  • Cancer and Cancer Screening

    Identifying recurrent breast cancer using EHR text; Colonoscopy quality metrics

  • Pharmacoepidemiology

    Methods for identifying patients’ addiction to prescription opioids; Cost and utilization of health care services among patients with problem opioid use

  • Medication Use & Patient Safety

    Surveillance methods for problem use of prescription opioids; healthcare costs and utilization associated with problem opioid use

  • Health Informatics

    Methods for using structured and unstructured electronic health record data to identify patients with (or without) specific clinical conditions or phenotypes for large scale epidemiological and genomic studies

Recent publications

Masters ET, Ramaprasan A, Mardekian J, Palmer RE, Gross DE, Cronkite D, Von Korff M, Carrell DS. Natural language processing-identified problem opioid use and its associated health care costs. J Pain Palliat Care Pharmacother. 2019 Jan 31:1-10. doi: 10.1080/15360288.2018.1488794. [Epub ahead of print]. PubMed

Mercaldo ND, Brothers KB, Carrell DS, Clayton EW, Connolly JJ, Holm IA, Horowitz CR, Jarvik GP, Kitchner TE, Li R, McCarty CA, McCormick JB, McManus VD, Myers MF, Pankratz JJ, Shrubsole MJ, Smith ME, Stallings SC, Williams JL, Schildcrout JS. Enrichment sampling for a multi-site patient survey using electronic health records and census data. J Am Med Inform Assoc. 2018 Dec 27. pii: 5263776. doi: 10.1093/jamia/ocy164. [Epub ahead of print]. PubMed

Mosley JD, Benson MD, Smith JG, Melander O, Ngo D, Shaffer CM, Ferguson JF, Herzig MS, McCarty CA, Chute CG, Jarvik GP, Gordon AS, Palmer MR, Crosslin DR, Larson EB, Carrell DS, Kullo IJ, Pacheco JA, Peissig PL, Brilliant MH, Kitchner TE, Linneman JG, Namjou B, Williams MS, Ritchie MD, Borthwick KM, Kiryluk K, Mentch FD, Sleiman PM, Karlson EW, Verma SS, Zhu Y, Vasan RS, Yang Q, Denny JC, Roden DM, Gerszten RE, Wang TJ. Probing the virtual proteome to identify novel disease biomarkers. Circulation. 2018;138(22):2469-2481. doi: 10.1161/CIRCULATIONAHA.118.036063. PubMed

Hall TO, Stanaway IB, Carrell DS, Carroll RJ, Denny JC, Hakonarson H, Larson EB, Mentch FD, Peissig PL, Pendergrass SA, Rosenthal EA, Jarvik GP, Crosslin DR. Unfolding of hidden white blood cell count phenotypes for gene discovery using latent class mixed modeling. Genes Immun. 2018 Nov 21. doi: 10.1038/s41435-018-0051-y. [Epub ahead of print]. PubMed


Latest News

Kaiser Permanente researchers explore patients’ marijuana use

Routinely asking about cannabis use can better serve patients by helping clinicians start conversations about risks and benefits.

Read it in News and Events.

healthy findings blog

Health records can reveal early signals of slow changes like Alzheimer’s

The words we use and our doctors’ notes hold hints about our health. Data Scientist David Carrell, PhD, tells how we’re learning to catch those clues. 

Read it in Healthy Findings.

Natural Language Processing

‘Teaching’ computers to read doctors’ notes promises gains in research efficiency