De-Identification Profile
0.0.1-current - ci-build International flag

De-Identification Profile, published by IHE IT Infrastructure Technical Committee. This guide is not an authorized publication; it is the continuous build for version 0.0.1-current built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/IHE/ITI.DeIdHandbook/ and changes regularly. See the Directory of published versions

Data Types

De-identification involves removing or transforming personally identifiable information (PII) to protect privacy while preserving data utility for analysis, research, or sharing. This chapter organizes data types into two levels: Element Level (individual data elements with specific de-identification techniques) and Dataset Level (collections of elements in formats like FHIR or DICOM, requiring identification of risky elements, per-element mitigations, and dataset-wide risk quantification).

The element-level focus is on selecting techniques like removal, pseudonymization, or generalization for single attributes. At the dataset level, the process involves: (1) cataloging risky elements using standards like DICOM PS3.15 Annex E or HL7 FHIR; (2) applying element-specific mitigations; and (3) quantifying re-identification risk using statistical models (e.g., k-anonymity, l-diversity), which cannot be assessed at the element level alone. This framework draws from HIPAA, GDPR, DICOM PS3.15 Annex E, HL7 FHIR guidance, NIH genomic policies, and ISO/TS 22220 for person identification.

Level 1: Element Level

Data elements are categorized by re-identification risk, with techniques tailored to each. This section integrates detailed examples and approaches from standards like ISO/TS 22220 and DICOM PS3.15 to enhance specificity.

Categories and Techniques

Category Description Examples De-Identification Techniques Risk Focus
Direct Identifiers Elements uniquely identifying individuals without additional data. High risk. Names (full, legal, preferred, or other as per ISO/TS 22220), Unique IDs (e.g., Social Security Number, medical record number, patient account number, vehicle identifiers), Biometrics (fingerprints, voice prints, photographs), Contact details (email, phone, fax, pager, IP address, device serial numbers), Digital certificates, Mother’s maiden name, Family linkages (e.g., mother, sibling), Tattoos/identifying marks. Remove entirely or pseudonymize (e.g., replace with UUID). HIPAA mandates removal of 18 identifiers. Prevent direct linkage; conduct expert determination if retention is needed; assess residual risks.
Quasi-Identifiers Elements that, combined with others, enable re-identification via linkage attacks. Medium risk. Dates (birth, admission, discharge), Ages (especially >89), Locations (ZIP codes, geocodes, regional codes with ≤20,000 people), Demographics (gender, ethnicity, occupation, religion, language spoken at home), Birth plurality (e.g., multiple gestation). Generalize (e.g., age to 5-year bands, truncate ZIP to 3 digits, change codes for small populations to generic like 0001), Shift dates relative to milestones (e.g., “x months after treatment”). Use k-anonymity for safe ranges. Evaluate combinatory risks; suppress unique values to prevent linkage; run re-identification risk analysis.
Sensitive Attributes Non-identifying but protected data, often context-dependent. Variable risk. Diagnoses, procedures, lab results, criminal history, recessive traits, rare diagnoses, uncommon procedures (e.g., tennis professional), distinct deformities. Retain with masking or noise addition for utility; remove outliers (e.g., rare diagnoses) based on risk assessment; apply l-diversity or t-closeness for distribution protection. Balance privacy and utility; use differential privacy to limit inference; assess for uniqueness; apply l-diversity or t-closeness as common practices.
Free Text Unstructured text with potential embedded identifiers. High risk due to unpredictability. Physician notes, referral letters, SOAP notes, chief complaints, nursing observations, triage notes, test interpretations, impressions. NLP scrubbing (e.g., using Presidio to remove PII), Convert to coded forms, Enforce policies to exclude identifiers (e.g., patient numbers), Verify low risk if derived from structured data. Scan for hidden PII; implement mitigation strategies; consider computational conversion to coded concepts.
Multimedia Visual or audio data with embedded identifiers. High risk. Radiology images with annotations, Voice recordings, Videos. Redact (e.g., blur faces, remove burned-in text per DICOM PS3.15 Annex E), Clean pixel data, Remove non-parseable content. Use tools like ARX for anonymization. Assess visual/auditory identifiers; ensure no embedded PII in headers or visuals; conduct additional risk assessments.
Geo-Location Spatial data that can pinpoint individuals. Medium-high risk. GPS coordinates, residential addresses, regional codes, postal codes. Aggregate (e.g., to city or regional level), Add noise, Strip fine-grained levels (e.g., change postal codes with <20,000 people to 0001). Prevent triangulation with other data; conduct geospatial risk analysis to assess linkage potential.

Level 2: Dataset Level

Datasets are collections of elements organized in specific formats (e.g., FHIR, DICOM). Risk analysis involves: (1) Identifying risky elements using standard-specific lists; (2) Applying per-element mitigations from the element-level categories; (3) Quantifying dataset-wide re-identification risk using statistical models (e.g., k-anonymity, l-diversity, t-closeness) to address linkage or inference risks not visible at the element level. This section incorporates DICOM-specific tag guidance and generalizes approaches for other dataset types.

Dataset Types and Approaches

Dataset Type Description & Organization Risky Elements Identification Per-Element Mitigation Dataset-Wide Risk Quantification
FHIR (Fast Healthcare Interoperability Resources) Standardized format for healthcare data exchange, organized as resources (e.g., Patient, Encounter) with structured elements (e.g., identifiers, narratives). Patient.identifier (e.g., medical record numbers), Encounter.period dates, Narrative free text, Patient demographics (gender, ethnicity, language). HL7 provides PII lists for resources. Remove direct identifiers (e.g., Patient.name), Generalize dates (e.g., to year), Scrub free text via NLP or policies, Pseudonymize IDs for traceability. Apply k-anonymity across resources to ensure no unique combinations; Use differential privacy for aggregated queries; Conduct annual risk assessments per GDPR/HL7.
DICOM (Digital Imaging and Communications in Medicine) Standard for medical imaging, organized by tags/attributes (e.g., headers, pixel data). Per DICOM PS3.15 Annex E: Patient’s Name (0010,0010), Patient ID (0010,0020), Birth Date (0010,0030), Institution Name (0008,0080), Study Description (0008,1030 - free text), Pixel Data (7FE0,0010 - visuals), Mother’s Maiden Name, Tattoos/Identifying Marks. Delete (D) names/IDs, Clean (X) dates/locations/free text, Pseudonymize (U) UIDs; Use options like Clean Pixel Data for visuals or Retain Longitudinal for date-shifted data. Measure re-identification risk via tag inference; Apply profiles (e.g., Basic + Retain Longitudinal) with statistical validation (e.g., k-anonymity); Conduct routine risk analysis.
Genomics Sequence data (e.g., DNA variants) in formats like VCF or BAM, with metadata (e.g., sample IDs). Highly unique due to genetic markers. Sample identifiers, Rare variants (e.g., SNPs), Metadata (dates, locations), Family linkages (e.g., mother, sibling). NIH requires HIPAA-compliant de-identification. Suppress rare variants, Pseudonymize IDs, Aggregate to summary statistics, Scrub metadata using computational tools. Quantify via beacon attacks or kinship inference; Apply differential privacy; Assess unique cybersecurity risks per NIST guidelines.
Structured/Tabular (e.g., CSV, Databases) Row-column format with fields like IDs, demographics. Organized as tables. Columns with direct/quasi-identifiers (e.g., name, age, ZIP, tattoos), Outlier rows (e.g., rare diagnoses, distinct deformities). Tools like sdcMicro identify risks. Generalize columns (e.g., age bands, truncate ZIP codes), Suppress unique rows/outliers, Use tools like ARX for automation. Apply k-anonymity/l-diversity across table; Quantify linkage risks; Conduct annual risk analysis for persistent datasets.
Transactional (e.g., Logs, Databases) Event-based records (e.g., purchases, visits) with timestamps and IDs. Organized chronologically or by entity. Transaction IDs, Timestamps (quasi-identifiers), User details (e.g., contact info), Behavioral patterns, Family linkages. Pseudonymize IDs, Shift timestamps, Suppress sensitive transactions or identifiers (e.g., mother’s maiden name). Perform trajectory analysis for re-identification risk; Apply differential privacy on aggregates; Assess pattern-based risks.
Time Series Sequential data (e.g., sensor readings, vital signs) with timestamps. Organized by time intervals. Timestamps, Unique patterns (e.g., waveforms), Linked identifiers (e.g., device serial numbers). Add noise to values, Generalize time intervals, Randomize values while preserving trends, Remove linked identifiers. Measure pattern-based re-identification (e.g., via machine learning attacks); Use t-closeness for distribution protection; Conduct routine risk analysis.

Conclusion

The element-level framework ensures targeted protection for individual data elements, enhanced by specific examples (e.g., tattoos, family linkages) and mitigation strategies (e.g., converting free text to coded forms) from standards like ISO/TS 22220 and DICOM PS3.15. The dataset-level approach provides comprehensive risk management by identifying risky elements, applying per-element mitigations, and quantifying dataset-wide risks. Regular audits and domain-specific extensions (e.g., adding new dataset types like IoT or EHR formats) are recommended to maintain robust privacy protection.