De-Identification Profile
0.0.1-current - ci-build
De-Identification Profile, published by IHE IT Infrastructure Technical Committee. This guide is not an authorized publication; it is the continuous build for version 0.0.1-current built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/IHE/ITI.DeIdHandbook/ and changes regularly. See the Directory of published versions
Projects that need de-identification should follow these steps to define the de-identification process that is appropriate for the project’s intended uses of the de-identified data. This process should be adapted to specific regulatory and legal contexts, such as HIPAA, GDPR, or PIPL.
The dataset's context refers to the environment in which the data is stored and transferred. To understand the complete situation, it's essential to analyze the purpose of data collection, the data recipients, and the data flow.
The intended uses of the data determine the extent of de-identification and the acceptable level of risk. The purpose must be clearly and formally documented. This documentation should justify each data element that is needed, which in turn determines what data is preserved (pass-through), what data is removed (redacted), and what data is transformed (e.g., generalized or perturbed).
Data recipients are the individuals, groups, or organizations who will use the de-identified data. It is crucial to identify and document all data recipients to understand the data sharing context and associated risks. This analysis should include:
Describe the end-to-end data flow, from original source to final recipients. A clear data flow diagram helps identify risks at each stage. Key components to analyze include:
A thorough assessment of the data itself is a prerequisite for designing an effective de-identification strategy.
The scope of data collection should be limited to the minimum necessary to achieve the defined purpose. Certain data types are inherently more challenging and time-consuming to de-identify and should be given special attention:
Describe the characteristics of the data subjects to understand the population and assess re-identification risk. Key characteristics include:
Also, determine if the data subjects belong to a vulnerable population or if the data is subject to special sensitivity rules (e.g., behavioral health data).
Identify all distinct types of data being collected (e.g., structured records, images, free-text files). It is critical to be exhaustive and capture auxiliary information where identifiers might be hidden, such as:
For each data attribute, perform a classification to understand its potential for re-identification. This is a critical step that informs the entire risk assessment and mitigation design. Attributes are typically classified as:
The properties of the dataset as a whole can influence re-identification risk:
This involves thinking like an adversary to understand the threats that the de-identification process must protect against.
The way data is shared determines the level of control and the potential for attack. Common models include:
Based on the adversary's goals, attacks can be categorized as:
Formal privacy models provide a mathematical framework for measuring and controlling privacy risk. The choice of model is a foundational design decision. Two common models are:
Goals should balance privacy protection with data utility.
Translate the general goals into concrete, measurable targets for the specific project. To achieve this, a requirements document should be created, addressing the following questions:
Define Acceptable Risk Thresholds: Set a maximum acceptable re-identification risk level. This threshold is a quantitative measure that defines the boundary between Irreversibly Pseudonymized Data
and Anonymous Data
. There is no universal threshold; the appropriate value must be determined by analyzing two key factors: the data sharing model and the potential impact of a privacy invasion.
Factor 2: Potential Impact of Re-identification: The potential harm that could result from re-identification is the second key factor. A dataset containing highly sensitive information requires a more conservative threshold than one with low sensitivity. To assess the potential impact, consider factors such as the sensitivity of the information, the scope and level of detail, the potential for tangible harm (e.g., financial loss, discrimination, reputational damage), and whether individuals consented to this specific secondary use of their data.
Scenario | Possibility of Attack & Potential Impact | Risk Threshold |
---|---|---|
Public | High possibility of attack, low impact | Max 0.1 |
High possibility of attack, medium impact | Max 0.075 | |
High possibility of attack, high impact | Max 0.05 | |
Non-public | Low-med possibility of attack, low-medium impact | Avg 0.1 |
Medium possibility of attack, medium impact | Avg 0.075 | |
Medium-high possibility of attack, medium-high impact | Avg 0.05 |
These benchmarks provide a structured framework for decision-making. For specific, high-stakes scenarios like the public release of clinical trial data, regulatory bodies have set their own standards. For example, the European Medicines Agency (EMA) and Health Canada have established a risk threshold of 0.09. This aligns with the principles for a public, high-impact release, demonstrating a real-world application of a conservative risk threshold to protect patient privacy.
Risk assessment can be both qualitative and quantitative.
A qualitative evaluation provides an initial classification of the dataset's identifiability level. This assessment is based on the types of identifiers present in the data after initial transformations and determines if a full quantitative evaluation is necessary. The classification aligns with the levels of identifiability defined in this handbook.
When a qualitative evaluation determines that a dataset contains indirect (quasi) identifiers, a quantitative evaluation is required to determine its final classification. This process uses objective, statistical methods to calculate a precise overall re-identification risk score. This score is then compared against the project's acceptable risk threshold to determine if the data can be considered Anonymous or must be treated as Irreversibly Pseudonymized.
Following the standard risk model described in (ISO/IEC 27559, 2022), identifiability can be conceptualized as the product of the probability of identification given a specific threat and the probability of that threat being realized. That is:
[P(\text{identification}) = P(\text{identification} | \text{threat}) \times P(\text{threat})] |
This model provides a valuable framework for understanding the two key components of re-identification risk:
Data Risk: The risk inherent in the data itself, corresponding to (P(\text{identification} | \text{threat})). |
While this formula provides the conceptual basis, its practical application differs significantly between the primary privacy models:
The following sections detail how to assess these risks for each model.
Data risk ((R_d)) is the probability of re-identification based on the properties of the dataset itself. The specific method for calculating it depends directly on the formal privacy model being used.
For k-Anonymity and related models:
When using k-anonymity, data risk is calculated by analyzing the size of the "equivalence classes" (groups of records with identical quasi-identifiers). Common metrics include:
Average probability of re-identification ((R_{d,c})): The average risk across all equivalence classes, (R_{d,c}) = (\frac{1}{ | J | }\sum_{j \in J}^{}\theta_{j}). This may be appropriate for more controlled sharing models. |
The foundational methods for these calculations are detailed in (Sweeney, 2002). The selected metric ((R_{d,b}) or (R_{d,c})) becomes the value for (R_d) used in the overall risk calculation.
For Differential Privacy:
Differential Privacy (DP) takes a different approach. Instead of calculating a post-hoc re-identification risk, DP provides a proactive mathematical guarantee of privacy, quantified by the privacy loss parameter, epsilon ((\epsilon)).
While methods exist to relate (\epsilon) to a probabilistic re-identification risk, they are complex and model-dependent. For the purposes of this handbook, the primary method for managing data risk under DP is the selection and enforcement of the privacy budget ((\epsilon)).
Context risk ((R_c)) assesses the likelihood of a re-identification attempt based on the data sharing environment. Its application differs depending on the chosen privacy model.
For k-Anonymity and related models:
The goal is to calculate a specific probability for (R_c), which is then used in the overall risk formula ((R = R_d \times R_c)). This probability is estimated as the maximum of three component threats:
[ R_c = \max(T1, T2, T3) ]
Where:
For public data releases, (R_c) is typically assumed to be 1 (100%), reflecting the high likelihood of an attack attempt. For controlled models, these probabilities are estimated based on the security controls, contractual obligations, and the nature of the data recipients. For detailed methodologies on estimating these probabilities, see guidance from sources like (IPC_ONTARIO, 2016) and (El Emam, 2013).
For Differential Privacy:
Context risk is not used to calculate a final number for multiplication. Instead, the assessment of the context directly informs the selection of the privacy budget ((\epsilon)). The same threat components (T1, T2, T3) are evaluated to justify the choice of (\epsilon).
The output of the context risk assessment in a DP model is a documented rationale for the chosen (\epsilon) value, linking it directly to the environmental and sharing risks.
The final step is to determine the overall re-identification risk by combining the data risk and context risk, though the method differs by privacy model.
For k-Anonymity and related models:
The overall risk is the product of the data risk (the chosen metric, e.g., (R_{d,b})) and the context risk ((R_c)). This final value is then compared against the project's acceptable risk threshold.
[ R = R_d \times R_c ]
For example, if the maximum data risk ((R_{d,b})) is 0.1 and the context risk for a controlled sharing environment ((R_c)) is estimated at 0.5, the overall risk would be (0.1 \times 0.5 = 0.05). This value would need to be below the project's defined threshold.
For Differential Privacy:
With Differential Privacy, the overall risk is not a calculated product. Instead, the context risk ((R_c)) directly influences the selection of the privacy budget ((\epsilon)).
The project must document the rationale for how the data sharing context and potential harms informed the choice of (\epsilon). The "pass/fail" criterion is whether the implemented system can enforce this chosen (\epsilon) for all data queries.
This phase involves designing and applying controls to reduce the identified risks to an acceptable level.
De-identification is often not a single action but a multi-stage process, especially in complex environments like healthcare. A multi-stage design is necessary to accommodate the practical limitations of data sources and to apply the appropriate level of expertise and technology at each step. While a two-stage process is a common example, a true multi-stage architecture can involve several specialized steps.
A multi-stage approach is often essential for several practical and technical reasons:
This flexible approach allows an organization to create a de-identification pipeline. The process can start at the source and then hand off the partially-processed data to subsequent stages, each designed to handle a specific challenge—such as text, imaging, or structured data—before a final risk assessment is performed.
The multi-stage de-identification process can be visualized as a series of state transitions, where data moves from a higher level of identifiability to a lower one. Each transition is achieved by applying specific de-identification techniques, as defined in the Concepts chapter.
A three-stage model provides a comprehensive example of this workflow:
graph TD
subgraph "Stage 1: Preliminary De-Identification"
direction LR
A[Identified Data] -->|Reversible Pseudonymization| B(Readily-Identifiable Data);
end
subgraph "Stage 2: Advanced De-Identification"
direction LR
B -->|Irreversible Pseudonymization<br/>Anonymization w/ Non-Negligible Risk| C(Irreversibly Pseudonymized Data);
C -->|Anonymization w/ Negligible Risk| D(Anonymous Data);
end
subgraph "Stage 3: Recipient Verification"
direction LR
D --> E(Recipient Risk Verification);
C --> E;
end
style A fill:#f9f,stroke:#333,stroke-width:2px
style B fill:#f9f,stroke:#333,stroke-width:2px
style C fill:#ccf,stroke:#333,stroke-width:2px
style D fill:#9c9,stroke:#333,stroke-width:2px
style E fill:#f5f5f5,stroke:#333,stroke-width:2px
This diagram illustrates a typical workflow:
Readily-Identifiable Data
.Irreversibly Pseudonymized
or fully Anonymous
states.Beyond this three-stage example, a more granular, multi-stage process might dedicate separate sub-stages within the Advanced phase to handling specific data types, such as a dedicated NLP pipeline for free text or an image processing stage for pixel data, before a final, holistic risk assessment is performed.
When designing a multi-stage process, the following factors must be considered:
The design must demonstrate how the chosen techniques will make the data resistant to re-identification. This is often measured by the expected percentage of participants who could be re-identified using established methods, and is a primary goal of later-stage analysis.
Element-by-Element De-identification Design
One key element of the design is a listing of how each possible data element in the input data set will be processed. It is not possible to create a single universally appropriate table. Examples like these can act as a starting point for purpose-specific designs.
The DICOM standard provides initial starting point tables for commonplace de-identification requirements for imaging results in the DICOM format. These are in PS 3.15 Annex E, especially Table E.1-1. The DICOM standard is freely available for use, and permission is granted for public and private use of extracts. It is provided in Word format to simplify such use. Note that the DICOM standard identifies private attributes that are claimed to lack personal information; other treatment of private attributes must be part of the design process.
There are also project and other examples available, such as the Biosurveillance Use Case Minimum Data Elements Specification, that can serve as a reference.
This refers to identifiers generated during the de-identification process itself, such as a batch ID or a log of transformations. These must also be managed securely.
Technical de-identification must be supported by strong operational and security policies.
Implement the principle of least privilege, ensuring that personnel only have access to the data necessary to perform their roles. This should be governed by the organization's Human Resource Security Policy.
Secrets used in the de-identification process must be managed securely. This includes:
Use a secure vault or password manager and strictly limit access to these secrets.
Data must be protected in transit. Use secure, approved methods for transferring data, such as:
Data must be protected at rest. Use strong, industry-standard encryption algorithms like AES-256 to encrypt datasets before storage or transfer.
Unsecure data disposal can lead to breaches. Follow a formal data disposal policy, using tools that perform a secure, multi-pass wipe (e.g., DoD-level wipe) to permanently erase data from media before it is decommissioned.
IHE has profiles, such as the imaging teaching files profile, for some of the common de-identification situations. DICOM has identified some common intended use requirements and defined de-identification profiles for these situations. Other organizations have published their de-identification profiles. When developing project-specific de-identification profiles, these can be a good starting point.
This document does not cover the implementation of processes, procedures, software, or staffing of the de-identification system. There are already established methodologies for project management, safety risk analysis, etc. These are also applicable to the deployment of de-identification processes, and there is usually no need to invent new, unfamiliar processes for the organization.
A strong governance framework is essential for ensuring that the de-identification process is effective, compliant, and maintained over time.
The de-identification program should be founded on principles of accountability, fairness, and transparency, in alignment with the organization's overall data governance and privacy policies.
Clearly define roles and responsibilities to ensure accountability. Key roles often include:
All personnel involved must receive appropriate training on data handling and their specific responsibilities.
The design should be validated before it is fully implemented. Validation should focus on confirming that identified project risks are either reduced or identified as protection requirements. This must include risks related to overall project objectives, risks to individual patient privacy, and risks of non-compliance with applicable policy. The validation review may need to include stakeholders such as an Institutional Review Board (IRB).
The design validation phase consists of three steps:
A best practice is to identify a small subset of the project data (or a very similar dataset) for this review. You can write test scripts or manually process this data using the anticipated algorithms and then provide the processed data to the end-user to confirm it meets their expectations. As should be apparent, the more data that can be removed, the easier the validation phase and subsequent steps become, and the lower the risks.
After the process is implemented, it must be validated with real data in the operational environment. This validation is part of any healthcare system deployment and existing processes should apply.
De-identification adds one extra element: an operational validation using a subset of real data should be performed. This is very similar to the initial design validation but uses the operational system processes, staff, and software.
De-identification is not a one-time activity. The environment is a continuously moving target, as re-identification algorithms improve and new public datasets become available that could be linked. Computer and storage capabilities increase rapidly; for example, genetic data that was once too costly to use for identification is now usable because the cost of storage and computing has plummeted. Previously private information is increasingly available for sale, and you can now track a person’s location from cell phone records to match a person’s ID to healthcare provider visits.
One famous example of this is the analysis by Latanya Sweeney from CMU, who used just three data points (Date of Birth, current ZIP Code, and Sex) to re-identify a high percentage of individuals in a dataset. An example of the impact of changing technology is the evaluation of risks from the unrestricted publication of raw genetic data by Erlich and Narayanan.
Therefore, the process must be continuously monitored and periodically audited. A re-evaluation of risk should be triggered by any significant change in the data context or on a regular schedule.
Maintain thorough documentation of the entire process for accountability, auditing, and compliance. Essential records include:
Follow the organization's established security incident management policy to respond to any suspected or actual data breach or privacy incident.