Skip to Main Content
Skip Nav Destination
Article navigation

The United States Immunodeficiency Network (USIDNET) is a registry that collects de-identified patient data from hospitals across the country to study rare immune conditions, including inborn errors of immunity (IEI). This registry contains both structured data and clinical notes, such as pathology reports and imaging narratives. These text fields may include protected health information (PHI), which must be removed before the data can be added to the registry. To address this, we have developed a tool called “PHIdentifier.” Unlike manual de-identification, which is time-consuming and error prone, this tool is designed specifically for clinical notes and automatically removes PHI from the text. It allows valuable clinical details to remain available for research while still protecting patient privacy and supporting better patient care.

PHIdentifier runs on a secure, high-performance computing (HPC) environment to efficiently process large volumes of text data. It uses the Qwen-2.5-7B-Instruct large language model (LLM), combined with rule-based checks, to handle complex text patterns and ensure consistent de-identification across different types of note fields. The workflow starts with standard text preprocessing by organizing and preparing notes to ensure they are clean, structured, and ready for further analysis. The tool then performs a multilayered de-identification process, using carefully crafted prompts to instruct the model to detect PHI from the text. The model’s responses are combined with rule-based checks to ensure that only sensitive information is replaced with placeholders, preserving all other clinical content. We perform additional quality checks to ensure data accuracy and consistency across notes, creating a reliable process that converts unstructured text into fully de-identified information.

PHIdentifier was tested on 3,000 narrative and pathology notes, achieving a precision of approximately 97.9–98.1% and a recall of 95.7–97.1%. Nearly all flagged items were true PHI, with only a moderate number of nonsensitive elements over-redacted, and very few direct identifiers, such as hospital names, locations, or years, were missed. This strong performance enables the tool to improve the overall quality and completeness of the registry dataset for rare immunodeficiency disorders. These results show how an LLM de-identification tool can make data collection more efficient and help protect patient privacy.

This abstract is available under a Creative Commons License (Attribution 4.0 International, as described at https://creativecommons.org/licenses/by-nc-nd/4.0/).

or Create an Account

Close Modal
Close Modal