Skip to Main Content



  • Natural language processing (NLP) is a collection of tools and computer algorithmic techniques that aim to help humans “structure” and gain an in-depth understanding of free text information.

  • Overview of different types of NLP tools:

    • Vocabulary- and rule-based NLP is the oldest but most easily interpretable type of NLP. Complex clinical NLP pipelines take a lot of resources and years to build and are often difficult to adapt to different clinical domains. However, simple look-up–type techniques can be useful in many clinical auditing cases where precision is more important than recall.

    • Supervised NLP is powerful as long as there is a large enough human labeled data set to train the machine learning model. However, task-specific and large well-labeled data sets take substantial clinical resources to curate.

    • With unsupervised NLP, there is no need for labeling because these machine learning models can automatically discover patterns in the data and propose groups or classes. However, a human needs to interpret the resulting groups to figure out the “why” and “what.”

    • Expert-in-the-loop NLP: What if we make the experts more efficient at helping the machine learn a task? The challenge is to present questions or uncertainties from the machine models to the human in a user-friendly and interactive manner.

  • In health care, there is no one-size-fits-all NLP solution. There are many tasks in the clinical domain amenable to different types or combinations of NLP methods. Understanding the performance requirements of the clinical task and the limitations of different NLP tools can help with implementing the most appropriate NLP solution.


In health care, natural language is still the most common communication tool for conducting and recording patient-provider and provider-provider interactions in the electronic health records (EHRs). Computers, however, prefer structured data. Lists of diagnosis codes, medication codes, and procedure codes are easy to search, tabulate, and aggregate, and thus are the go-to sources for data science and analytics performed on clinical records. However, such clinical coding is historically based on administrative and billing requirements, where typically only the top 1 or 2 issues per patient visit are structured or “coded” in the form of International Classification of Diseases (ICD), Current Procedural Terminology (CPT), diagnosis-related group (DRG), and other billing codes.1 However, many clinically important observations are not absolutes or discretes that lend themselves to be structured (coded) in check boxes on a form. As such, the bulk of recorded health care information about patients is currently stored as unstructured data in the EHR.2,3 Any free text documentation, such as clinical notes or investigation reports, is unstructured data, which cannot be easily automatically analyzed to provide valuable insights into improving clinical care.

Because computers were previously quite expensive, the focus of their use was to maximize financial reimbursement of the patient-provider encounter.4 As a result, the structured portion of the EHR still disproportionately captures diagnoses and health ...

Pop-up div Successfully Displayed

This div only appears when the trigger link is hovered over. Otherwise it is hidden from view.