David Carrell, PhD

“My work uses computers to mine and analyze information about patients’ health from the millions of clinical notes Kaiser Permanente Washington doctors and nurses write about their patients in a typical year.”

David Carrell, PhD 

Assistant Investigator, Kaiser Permanente Washington Health Research Institute 

Areas of focus:


David Carrell, PhD, is an assistant investigator whose most significant contributions to science entail the development and application of technologies for extracting rich information from unstructured clinical text, such a physician progress notes. This work uses state of the art clinical natural language processing (NLP) technologies in single- and multi-site settings.

Precision phenotyping, which is the process of using information from electronic health records (EHR) to identify patients with (or without) specific clinical characteristics, is a major focus of Dr. Carrell’s work. An example of this work is the development of an NLP system to identify women who have been diagnosed with recurrent breast cancer. Despite it being such a common and consequential clinical diagnosis, recurrent breast cancer is not a condition that can be identified from standardized diagnosis codes found in a person’s chart. Such codes appear in a chart to justify, for example, an imaging procedure when there is suspicion of possible recurrence or as part of a standard monitoring plan. In such cases, a code for recurrence will appear in the chart, but the patient’s chart notes will also state that the imaging study did not find evidence of disease.

For many years, researchers had desired a robust, automated method to identify women with breast cancer recurrences, yet it remained an unsolved problem because of the difficulty of isolating true positive cases (women with diagnosed recurrence) from false positive cases (women for whom there was suspicion of disease but none was found).

Our solution, supported by a grant from the National cancer institute (“Natural Language Processing for Cancer Research Network Surveillance Studies,” RC1CA146917, Carrell, PI), incorporated information from clinician progress notes, radiology reports, and pathology reports. The NLP algorithm achieved very high sensitivity (equal to that of expert manual chart abstractors). Published in the American Journal of Epidemiology, this was the first major report in that journal illustrating the power of clinical NLP methods.

Working with teams of researchers inside and outside KPWHRI, Dr. Carrell has applied similar precision phenotyping methods to identify patients with stenosis of the carotid artery, colon polyps, and problem use of prescription opioids. He is also using these methods to help health care systems evaluate the performance of physicians conducting screening colonoscopy examinations.

Dr. Carrell has also developed and applied computing methods for automated de-identification of clinical text. Removing identifiers is important for preserving patient privacy when clinical text is shared with scientific collaborators outside Kaiser Permanente Washington. Automated de-identification methods make such sharing efficient and scalable.

In collaboration with scientists at MITRE Corporation and Vanderbilt University, Dr. Carrell has used a method called “machine learning” to train software systems to recognize patient identifiers in clinical text, even when the system encounters names, dates, and other identifiers it has never before seen.

Recently, this work was supported by the National Library of Medicine through a grant called “Scalable and Robust Clinical Text De-Identification Tools.” Also relevant to this work are studies Dr. Carrell has led investigating the cost and accuracy of creating manually annotated sets of documents used to training machine learned systems, and the vulnerability to reverse-engineering attacks of clinical text de-identified with automated methods.

Conducting research in diverse health care settings and patient populations is important to making our scientific discoveries robust and widely applicable. Throughout his career Dr. Carrell has participated in a number of large-scale multi-site research projects. Recent examples include projects in the Health Care Systems Research Network (HCSRN) and the Electronic Medical Records and Genomics (eMERGE) network. The latter is a twelve-site study exploring genetic predictors of disease in very large patient populations based on precision phenotypes derived from electronic medical records data. Dr. Carrell has also participated studies sponsored by the Patient Centered Outcomes Research Institute (PCORI) and the National Institutes of Health (NIH) Collaboratory.


  • Clinical Natural Language Processing

    Recurrent breast cancer; Colonoscopy quality; Extracting information from clinical text; Methods for using NLP methods in multi-site research

  • Clinical Text De-identification

    Automated methods for removing patient identifiers from clinical text; Vulnerability of automated de-identification methods to malicious attack

  • Cancer and Cancer Screening

    Identifying recurrent breast cancer using EHR text; Colonoscopy quality metrics

  • Pharmacoepidemiology

    Methods for identifying patients’ addiction to prescription opioids; Cost and utilization of health care services among patients with problem opioid use

  • Medication Use & Patient Safety

    Surveillance methods for problem use of prescription opioids; healthcare costs and utilization associated with problem opioid use

  • Health Informatics

    Methods for using structured and unstructured electronic health record data to identify patients with (or without) specific clinical conditions or phenotypes for large scale epidemiological and genomic studies

Recent publications

Hall TO, Stanaway IB, Carrell DS, Carroll RJ, Denny JC, Hakonarson H, Larson EB, Mentch FD, Peissig PL, Pendergrass SA, Rosenthal EA, Jarvik GP, Crosslin DR. Unfolding of hidden white blood cell count phenotypes for gene discovery using latent class mixed modeling. Genes Immun. 2018 Nov 21. doi: 10.1038/s41435-018-0051-y. [Epub ahead of print]. PubMed

Ezaz G, Leffler DA, Beach S, Schoen RE, Crockett SD, Gourevitch RA, Rose S, Morris M, Carrell DS, Greer JB, Mehrotra A. Association between endoscopist personality and rate of adenoma detection. Clin Gastroenterol Hepatol. 2018 Oct 13. pii: S1542-3565(18)31140-6. doi: 10.1016/j.cgh.2018.10.019. [Epub ahead of print]. PubMed

Stanaway IB, Hall TO, Rosenthal EA, Palmer M, Naranbhai V, Knevel R, Namjou-Khales B, Carroll RJ, Kiryluk K, Gordon AS, Linder J, Howell KM, Mapes BM, Lin FTJ, Joo YY, Hayes MG, Gharavi AG, Pendergrass SA, Ritchie MD, de Andrade M, Croteau-Chonka DC, Raychaudhuri S, Weiss ST, Lebo M, Amr SS, Carrell D, Larson EB, Chute CG, Rasmussen-Torvik LJ, Roy-Puckelwartz MJ, Sleiman P, Hakonarson H, Li R, Karlson EW, Peterson JF, Kullo IJ, Chisholm R, Denny JC, Jarvik GP; eMERGE Network, Crosslin DR. The eMERGE genotype set of 83,717 subjects imputed to ~40 million variants genome wide and association with the herpes zoster medical record phenotype. Genet Epidemiol. 2018 Oct 8. doi: 10.1002/gepi.22167. [Epub ahead of print]. PubMed

Mosley JD, Feng Q, Wells QS, Van Driest SL, Shaffer CM, Edwards TL, Bastarache L, Wei WQ, Davis LK, McCarty CA, Thompson W, Chute CG, Jarvik GP, Gordon AS, Palmer MR, Crosslin DR, Larson EB, Carrell DS, Kullo IJ, Pacheco JA, Peissig PL, Brilliant MH, Linneman JG, Namjou B, Williams MS, Ritchie MD, Borthwick KM, Verma SS, Karnes JH, Weiss ST, Wang TJ, Stein CM, Denny JC, Roden DM. A study paradigm integrating prospective epidemiologic cohorts and electronic health records to identify disease biomarkers. Nat Commons, 2018 Aug 30;9(1):3522. doi: 10.1038/s41467-018-05624-4. PubMed


Latest News

Kaiser Permanente researchers explore patients’ marijuana use

Routinely asking about cannabis use can better serve patients by helping clinicians start conversations about risks and benefits.

Read it in News and Events.

healthy findings blog

Health records can reveal early signals of slow changes like Alzheimer’s

The words we use and our doctors’ notes hold hints about our health. Data Scientist David Carrell, PhD, tells how we’re learning to catch those clues. 

Read it in Healthy Findings.

Natural Language Processing

‘Teaching’ computers to read doctors’ notes promises gains in research efficiency