De-Identification Profile
0.0.1-current - ci-build
De-Identification Profile, published by IHE IT Infrastructure Technical Committee. This guide is not an authorized publication; it is the continuous build for version 0.0.1-current built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/IHE/ITI.DeIdHandbook/ and changes regularly. See the Directory of published versions
(ISO20889) classified de-identification techniques as the following categries:
Identifiers including direct and indirect identifiers help reidentify individuals directly or indirectly. To reduce the re-identification risk, those identifiers need to be transformed in a way that is unlikely be used to link the de-identified records back to original personal identifiers. There are various techniques can be used to transform those identifiers.
The purpose of transforming direct identifiers is hiding the original personal direct identifiers in the pseudonymized data without damaging the quality of data consistency regarding linking relationships between records. Usually, records are linked together via personal direct identifiers. There might be multiple direct identifiers in the dataset that are need to be processed, for example, patient ID, patient Name, mobile phone number, email etc. However, not all of them are required in order to ensure the consistency of the transformed records. Thus, it's important to correctly select direct identifiers to be replaced with the pseudonyms. Those direct identifiers that are not selected to be pseudonymized should be removed or masked completely. For example, when patient ID is the only direct identifier selected to be pseudonymized, then patient name, mobile phone number, email should be removed or masked in a secure way. Typical masking techniques include(NIST_SP_800-188_2023):
For those direct identifiers, serving as maintaining the relationships between records and data subjects, pseudonymization techniques should be used. The original personal identifiers should be replaced with pseudonyms. Pseudonyms are special kind of personal identifier that is different from the normally used personal identifier and is used with pseudonymized data to provide dataset coherence linking all the information about a subject, without disclosing the real world person identity(ISO20889).
Various techniques can be used to create pseudonyms, and they can be classified into two groups, A) Pseudonyms independent of identifying attributes; B)Pseudonyms derived from identifying attributes using cryptography. The decision on which method to use hinges on considerations like the expenses involved in producing pseudonyms, the hash function's ability to avoid collisions (that is, the chance of two distinct inputs yielding identical outputs), and the approach for re-establishing the identity of the data subject in a managed re-identification process(mayer2011implementation).
When multiple organizations use the same pseudonymization scheme, they can trade data and perform matching on the pseudonyms. This approach is sometimes called privacy preserving record linkage (PPRL). Some PPRL approaches perform record linkage within a secure data enclave to minimize the risk of unauthorized re-identification. As an alternative, organizations can participate in a private set intersection protocol.
The pseudonym values are created in a way that they are independent of the original personal identifiers. Usually, random values are used to create unique pseudonyms. In order to link the pseudonyms back to it's original personal identifiers, a table (usually called linking table) is used to maintain the relationship between the pseudonyms created independently and the original personal identifiers. The secure material, like linking table, salt keys, data encription keys, should not be released together with the pseudonymized data, and appropriate technical and organizational measures should be established to protect it from unauthorized access. The secure material should be managed by a trusted party in a secure way which prevents data receipients from accessing those secure material. Only when the authorized re-identification is required, the authorized the party or individual can assess those secure material to peform re-identification. The trusted party is not necessary a third party, but in some cases it can be a third party. If the trusted party is a third party, data escrow can be leveraged to ensure those data is accessible under specific conditions.
Pseudonyms can be cryptographically derived from the values of the attributes that they replace through encryption or hashing.
Encryption Encrypt the identifiers with a strong encryption algorithm. After encryption, the key can be discarded to prevent decryption. However, if there is a desire to employ the same transformation at a later point in time, the key should be stored in a secure location that is separate from the de-identified dataset. Encryption used for this purpose carries special risks that need to be addressed with specific controls. Canonicalization is essential when generating pseudonyms via encryption because it ensures that minor variations in input data for the same real-world identity always produce the exact same pseudonym. Without it, an individual could inadvertently be assigned multiple pseudonyms, undermining data consistency and referential integrity. By standardizing the input (e.g., converting emails to lowercase), canonicalization guarantees a consistent, deterministic mapping, crucial for effective and secure pseudonymization.
Hashing with a keyed hash A keyed hash is a special kind of hash function that produces different hash values for different keys. The hash key should have sufficient randomness to defeat a brute force attack aimed at recovering the hash key (e.g.,SHA-256 HMAC with a 256-bit randomly generated key). As with encryption, the key should be secret and should be discarded unless there is a desire for repeatability. Hashing used for this purpose carries special risks that need to be addressed with specific controls. Hashing without a key generally does not confer security because an attacker can brute force all possible values to be hashed.
A brute-force attack on pseudonyms involves an attacker systematically guessing original identity data (e.g., names, emails) and applying the pseudonym generation logic (hashing or encryption) until a match with a target pseudonym is found. This can reveal the original sensitive information, especially if the original data is predictable or if crucial defenses like salting are not employed. Strong algorithms and proper salting are vital to make such attacks computationally infeasible.
In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.Source: Wikipedia. In the context of de-identification, categorical quasi-identifiers are categorical variables. In healthcare, patient gender, diagnostic code are typical examples of categorical quasi-identifiers.
Common techniques for de-identifying categorical variables include:
Generalization: This involves reducing the granularity of data by grouping specific values into broader categories. For example, replacing the ICD-10 diagnosis code E11.2(Type 2 diabetes with kidney complications) and E11.4(Type 2 diabetes with neurological complications) with E11(Type 2 diabetes mellitus). To apply the generalization in dealing with categorical quasi-identifiers, the hierarchy definition is required. For example, in the ICD-10 Hierarchy, E11.2 and E11.4 fall under E11. Selecting the appropriate level of the generalized value needs to be carefully discussed and conformed with the intended data receivers to make sure the reduced level of the granularity of the new value is acceptable regarding the purpose of the data collection.
Suppression: Removing or masking highly identifiable categories (e.g., omitting rare diagnosis code).
Randomization: Randomization techniques like permutation can be used to exchange values of categorical variables between records to obscure individual identities while maintaining overall distributions.
Blanking and imputing: Specific values that are highly identifying can be removed and replaced with imputed values. Methods for generating imputed values range from simple to statistically complex, aiming to replace sensitive, missing, or identifying data points with plausible substitutes while preserving data utility and privacy. Basic techniques include replacing values with the mean, median, or mode of the observed data, or a fixed constant. More advanced methods leverage statistical models, such as regression imputation, which predicts values based on other variables, or K-Nearest Neighbors (KNN) imputation, which uses values from similar records. For robust analysis, multiple imputation is often preferred, generating several complete datasets with different plausible imputed values and combining the results to account for uncertainty.
Compared with categorical variables, numerical variable may or may not have a limited or fixed number of possible values. For example, continuous numerical values, including fractions or decimals, have an unlimited possible values. Although the number of possible values for discrete numerical values or integer values for a given variable like, patient age, is fixed, the broad scope of the value domain can easily make it as an outlier data point, for example, patients over 90 years old.
For the numeric quasi-identifiers having a limited or fixed number of possible values, all the techniques of de-identifying categorical quasi-identifiers can be applied. The techniques of dealing with specialties of the numeric quasi-identifiers include:
Top and bottom coding: Outlier values that are above or below certain values are coded appropriately. For example, the Safe Harbor method under the HIPAA Privacy Rule calls for ages over 89 to be "aggregated into a single category of age 90 or older".
Micro aggregation: Micro-aggregation is a statistical disclosure control technique used to de-identify datasets containing continuous values by clustering records into small groups and replacing individual values with group-level averages. The process involves forming clusters of at least a minimum number of similar records (typically 3 or more) based on their continuous variable values, ensuring that records within each cluster are as homogeneous as possible. The original continuous values in each cluster are then replaced with their computed average, determined algorithmically. This approach aims to maximize within-cluster similarity to preserve data utility while reducing the risk of re-identification, balancing privacy protection with the retention of statistical properties for analysis (Domingo-Ferrer & Mateo-Sanz, 2002).
Generalize categories with small counts: Small counts of data subjects for a specific value of a numeric quasi-identifier increase the risk of re-identification. To mitigate this risk, categories with low counts can be generalized into a single category with a higher count. For example, if there is only one patient aged 80, 85, and 90, respectively, these individual age categories can be combined into a broader "80+" category. This technique could also be applied in contingency tables with several categories have small count values. For example, For example, rather than re-porting that there are two people with blue eyes, one person with green eyes, and one person with hazel eyes, it may be reported that there are four people with blue, green,or hazel eyes(NIST_SP_800-188_2023).
Noise addition: Also called "noise infusion," and "noise injection" (and occasion-ally "partially synthetic data"), this approach adds small random values to attributes. For example, instead of reporting that a person is 84 years old, the person may be reported as being 79 years old(NIST_SP_800-188_2023).
Dates are a common quasi-identifier and must be handled carefully to prevent re-identification. Common techniques include:
Geographic data is highly identifying. Techniques for de-identifying it include:
Genomic data is unique and inherently identifying. De-identification is extremely challenging and often focuses on risk mitigation rather than complete anonymization. Key strategies include:
Although advanced technical solutions for mitigating privacy risks in genetic data are still largely in the research and development phase, organizations are currently prioritizing privacy policies and data management practices that define privacy controls for genetic data (including, but not limited to, de-identification), while considering genetic data as not easily identifiable(Future of Privacy Forum, 2020).
Free-text fields like clinical notes or physician comments are a major challenge for de-identification because they often contain direct and indirect identifiers in unstructured formats. Automated de-identification of this content requires a combination of techniques.
A best practice is to use a hybrid approach. First, apply a set of highly accurate regular expressions to catch structured PII. Then, run an NLP model on the remaining text to identify more complex and context-dependent identifiers. This layered strategy maximizes the removal of sensitive information while minimizing the risk of altering non-sensitive clinical information.
Sensitive attributes (e.g., medical diagnoses, income) are the information that an attacker is trying to learn. While not identifiers themselves, protecting them is critical because their disclosure is often the primary goal of an attack. A significant challenge arises when de-identification techniques like k-anonymity, which focus on preventing re-identification, fail to prevent attribute disclosure. An attacker may not need to pinpoint a single individual's record if they can determine with high probability that a target belongs to a group that overwhelmingly shares the same sensitive trait. This risk is magnified when the sensitive attribute has a very limited number of distinct values (e.g., "cancer" vs. "no cancer"). In such cases, even if an equivalence class is large (e.g., 20 members, giving a low 5% re-identification risk), if all members share the same diagnosis, the attacker learns the sensitive attribute with 100% certainty.
This vulnerability is exploited through homogeneity and background knowledge attacks. A homogeneity attack succeeds when the sensitive values within an equivalence class lack diversity, allowing for confident inference. For example, if an attacker knows their target is in a 10-person equivalence class where everyone has the same rare disease, the target’s sensitive information is compromised. A background knowledge attack is more subtle, where an attacker uses external information to infer the sensitive attribute. For instance, if an equivalence class contains 19 people with a common cold and one with HIV, an attacker might infer that any individual in that group is highly unlikely to have HIV, which still reveals information.
To counter these inference attacks, privacy models that go beyond k-anonymity are necessary. The principle of l-diversity was introduced to directly address the problem of homogeneity. It strengthens k-anonymity by requiring that every equivalence class contains at least l distinct or "well-represented" values for the sensitive attribute. This ensures that even if an attacker successfully identifies an individual's equivalence class, they cannot infer the sensitive attribute with a confidence greater than 1/l. For example, if an equivalence class satisfies 3-diversity for a "Diagnosis" attribute, it must contain at least three different diagnosis values, making it significantly harder for an attacker to pinpoint the correct one.
While l-diversity prevents exact disclosure by ensuring variety, it can still be vulnerable if the distribution of sensitive values is skewed or if the values themselves, while different, are semantically similar. The t-closeness principle offers a more robust solution by requiring that the distribution of the sensitive attribute within each equivalence class is close to its distribution in the overall dataset. The "distance" between the two distributions must be no more than a predefined threshold t. By maintaining this distributional similarity, t-closeness ensures that an attacker gains no significant new information by learning which equivalence class an individual belongs to, thereby protecting against a wider range of attribute inference attacks.
k-Anonymity is a privacy model that provides a clear, measurable standard for de-identification. Unlike techniques that focus on single attributes (like pseudonymization or generalization), k-anonymity assesses the re-identification risk from the perspective of the entire dataset. It ensures that any individual in a de-identified dataset cannot be distinguished from at least k-1 other individuals based on their quasi-identifiers. The value of k represents the size of the "anonymity set," and the re-identification risk for any individual in that set is quantified as 1/k.
This is achieved by generalizing or suppressing quasi-identifiers until each record's set of quasi-identifiers is identical to at least k-1 other records in the dataset. These groups of identical records are called "equivalence classes." For example, if k=5, for any combination of quasi-identifiers (e.g., age, gender, zip code), there must be at least 5 records with those same generalized values. This makes it harder for an attacker to re-identify individuals by linking the data with other sources.
The typical application of k-anonymity is for microdata. Microdata refers to datasets where each row corresponds to a specific individual or unit of observation (like a person, a household, or a single transaction), and the columns correspond to different variables or attributes of that unit. The example table below is a form of microdata. For other data types, extensions have been developed:
For this example, we assume the quasi-identifiers are {Age, Gender, Zip Code}. The Name
is a direct identifier and Diagnosis
is a sensitive attribute. Consider the following original patient data (microdata):
Name | Age | Gender | Zip Code | Diagnosis |
---|---|---|---|---|
Alice | 28 | Female | 10021 | Hypertension |
Carol | 29 | Female | 10023 | Asthma |
Eve | 27 | Female | 10025 | Migraine |
Bob | 31 | Male | 10032 | Diabetes |
David | 32 | Male | 10034 | Hypertension |
To achieve 2-anonymity (k=2), we replace the direct identifier Name
with a 5-digit random number pseudonym and generalize the Age
and Zip Code
quasi-identifiers.
2-Anonymous Data:
Pseudonym | Age Range | Gender | Zip Code | Diagnosis |
---|---|---|---|---|
83451 | 25-29 | Female | 1002x | Hypertension |
49281 | 25-29 | Female | 1002x | Asthma |
10358 | 25-29 | Female | 1002x | Migraine |
62215 | 30-34 | Male | 1003x | Diabetes |
90872 | 30-34 | Male | 1003x | Hypertension |
In this transformed table, the data is 2-anonymous. The records have been grouped into two equivalence classes based on the quasi-identifiers {Age Range, Gender, Zip Code}:
Since the smallest equivalence class has 2 records, the dataset satisfies 2-anonymity. An attacker cannot distinguish any individual from at least one other person in the dataset, and the pseudonyms allow for internal record linkage without revealing real names.
Differential Privacy (DP) offers a formal, mathematical guarantee of privacy that is distinct from traditional de-identification methods. The core problem it solves is protecting against attacks that try to infer information about a specific individual by observing the output of a data analysis. Even if a dataset is anonymized, an attacker could run a query on it, then run the same query on a slightly different dataset (e.g., with one person's data removed) and infer information about that person by comparing the results. Differential Privacy makes this type of attack mathematically impossible.
As outlined by the National Institute of Standards and Technology (NIST), DP works by injecting a carefully calibrated amount of random noise into the output of a computation or analysis (NIST_SP_800-226). This noise is large enough to mask the contribution of any single individual, but small enough to ensure that the overall result remains statistically useful. The key idea is that the output of a query should not significantly change whether any particular individual's data is included in the dataset or not. This provides plausible deniability for every participant.
The strength of the privacy guarantee is controlled by a parameter called epsilon (ε), often referred to as the "privacy budget." A smaller epsilon means more noise is added, providing stronger privacy but potentially lower accuracy. A larger epsilon provides higher accuracy but a weaker privacy guarantee. The choice of epsilon represents a direct trade-off between privacy and utility, and its value must be carefully chosen based on the sensitivity of the data and the needs of the analysis.
Unlike methods like k-anonymity, which modify the data itself, Differential Privacy is typically applied to the analysis or query mechanism. This makes it a powerful tool for sharing aggregate statistics, training machine learning models, or generating synthetic datasets that retain the statistical properties of the original data without revealing information about individuals. Its formal guarantees and flexibility have made it a leading approach for privacy-preserving data analysis in modern systems.
Imagine a hospital database with patient records. An attacker knows their target, Bob, is in this database and wants to find out if he has diabetes.
The Attack (on a Non-Private System): An attacker could perform a "differencing attack" if they can query the database at two different points in time or query two different versions of it.
SELECT COUNT(*) FROM Patients WHERE Diagnosis = 'Diabetes';
The system returns the true, exact count: 100.The Defense (with a Differentially Private System):
A system using Differential Privacy prevents this by adding calibrated random noise to the query result.