Deidentification of free-text clinical notes with pretrained bidirectional transformers

Abstract

The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification, the process of removing identifiers, is challenging, however. High-quality annotated data for developing models is scarce; many target identifiers are highly heterogenous (for example, there are uncountable variations of patient names); and in practice anything less than perfect sensitivity may be considered a failure. Consequently, software for adequately deidentifying clinical data is not widely available. As a result patient data is often withheld when sharing would be beneficial, and identifiable patient data is often divulged when a deidentified version would suffice.In recent years, advances in machine learning methods have led to rapid performance improvements in natural language processing tasks, in particular with the advent of large-scale pretrained language models. In this paper we develop and evaluate an approach for deidentification of clinical notes based on a bidirectional transformer model. We propose human interpretable evaluation measures and demonstrate state of the art performance against modern baseline models. Finally, we highlight current challenges in deidentification, including the absence of clear annotation guidelines, lack of portability of models, and paucity of training data. Code to develop our model is open source and simple to install, allowing for broad reuse.

Date
Jul 24, 2020 1:00 PM — 1:08 PM
Location
Toronto, Canada
Avatar
Scientist

Alistair is a scientist focusing on data analysis in healthcare. This includes retrospective observational studies to generate new knowledge, predictive modeling using machine learning to prognosticate outcome, and curation of data to support the research enterprise.