De-Identification Profile
0.0.1-current - ci-build International flag

De-Identification Profile, published by IHE IT Infrastructure Technical Committee. This guide is not an authorized publication; it is the continuous build for version 0.0.1-current built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/IHE/ITI.DeIdHandbook/ and changes regularly. See the Directory of published versions

Data Types

The semantic type of each data element determines the algorithm or algorithms to apply to that element. Below we discuss various types of data.

This table can be used as a starting point. There are also standard specifications available (e.g., DICOM PS3.15 Annex E, see Appendix B of this document) that take this high level categorization and expand it to the individual attributes for particular kinds of data. Profile writers and others should extend these tables with any data types that are specific to their intended use.

Table: Data types

| Data types | Examples | Approaches | |—————————|———————————————————————|———————————————————————-| | Person identifying direct identifiers| person's name (including preferred name, legal name, other names by which the person is known), by name, we are referring to the name and all name data elements as specified in ISO/TS 22220;

person identifiers (including, for example, issuing authorities, types, and designations such as patient account number, medical record number, certificate/license numbers, social security number, health plan beneficiary numbers, vehicle identifiers and serial numbers, including license plate numbers); biometrics (voice prints, finger prints, photographs, etc.);

digital certificates that identify an individual;

mother's maiden name and other similar relationship-based concept (e.g., family links);

residential address;

electronic communications (telephone, mobile telephone, fax, pager, e-mail, URL, IP addresses, device identifiers, message control IDs, and device serial numbers); subject of care linkages (mother, father, sibling, child);

descriptions of tattoos and identifying marks.| Should be removed where possible, or aggregated at a threshold specified by the domain or jurisdiction. Where these data need to be retained, risk assessment of unauthorized re-identification and appropriate mitigations to identified risks of the resulting data resource shall be conducted.| | Aggregation variables| dates of birth and ages;

admission, discharge dates; and

location data| For statistical purposes, absolute data references should be avoided.

Dates of birth are highly identifying. Ages are less identifying but can still pose a threat for linking observational data, therefore it is better to use age groups or age types. In order to determine safe ranges, re-identification risk analysis should be run, which is outside the scope of this Technical Specification.

Admission, discharge dates, etc. can also be aggregated into types of periods, but events could be expressed relatively to a milestone (e.g., x months after treatment).

Location data, if regional codes are too specific, should be aggregated. Where location codes are structured in a hierarchical way, the finer levels can be stripped, e.g., where postal codes or dialing codes contain 20 000 or fewer people, the code may be changed to 0001)| | Demographic data are indirect identifiers| language spoken at home; person's communication language; religion; ethnicity; person gender; country of birth; occupation; criminal history; person legal orders; other addresses (e.g., business address, temporary addresses, mailing addresses); birth plurality (second or later delivery from a multiple gestation). | Should be removed where possible or aggregated at a threshold specified by the domain or jurisdiction. Where these data need to be retained, risk assessment of unauthorized re-identification and appropriate mitigations to identified risks of the resulting data resource shall be conducted.| | Outlier variables| rare diagnoses; uncommon procedures; some occupations (e.g., tennis professional); certain recessive traits uncharacteristic of the population in the information resource; distinct deformities.|Outlier variables should be removed based upon risk assessment.| |Persistent data resources claiming pseudonymity||Shall be subject to routine risk analysis for potentially identifying outlier variables. This risk analysis shall be conducted at least annually. The identified risks shall be coupled with a risk mitigation strategy.| |Structured data variables|vital signs; diagnosis; procedures; and lab tests and results.|Structured data give some indication of what information can be expected and where it can be expected. It is then up to re-identification risk analysis to make assumptions about what can lead to (unacceptable) identification risks, ranging from simple rules of thumb up to analysis of populated databases and inference deductions. In “free text”, as opposed to “structured”, automated analysis for privacy purposes with guaranteed outcome is not possible.| |Freeform text|Some examples are: Physician notes Referral letters SOAP notes Chief complaint Nursing observations Triage notes Test interpretation Susceptibility test interpretation Impressions|Freeform text cannot be assured anonymity. All freeform text shall be subject to risk analysis and a mitigation strategy for identified risks. Re-identification risks of retained freeform text may be mitigated through: implementation of policy surrounding freeform text content requiring that the freeform text data shall not contain directly identifiable information (e.g., patient numbers, names); verification that freeform content is unlikely to contain identifying data (e.g., where freeform text is generated from structured text); revising, rewriting or otherwise converting the data into coded form. Computationally convert the freeform text into coded concepts, thus releasing the need for the freeform text. As parsing and natural language processing "data scrubbing" and pseudonymization algorithms progress, re-identification risks associated with freeform text may merit relaxation of this assertion. Freeform text should be revised, rewritten or otherwise converted into coded form.| |Text/voice data with non-parseable content|Voice recordings|As with freeform text, non-parsable data should be removed.| |Image data|A radiology image with patient identifiers on image.|Some medical data contain identifiable information within the data. Mitigations of such identifiable data in the structured and coded DICOM header should be in accordance with DICOM PS 3.15 Annex E. Additional risk assessment shall be considered for identifiable characteristics of the image or notations that are part of the image. See DICOM PS 3.15 Annex E|

Data types Examples Approaches
Person identifying direct identifiers.

person's name (including preferred name, legal name, other names by which the person is known); by name, we are referring to the name and all name data elements as specified in ISO/TS 22220;

person identifiers (including, e.g., issuing authorities, types, and designations such as patient account number, medical record number, certificate/license numbers, social security number, health plan beneficiary numbers, vehicle identifiers and serial numbers, including license plate numbers);

biometrics (voice prints, finger prints, photographs, etc.);

digital certificates that identify an individual;

mother's maiden name and other similar relationship-based concept (e.g., family links);

residential address;

electronic communications (telephone, mobile telephone, fax, pager, e-mail, URL, IP addresses, device identifiers, message control IDs, and device serial numbers);

subject of care linkages (mother, father, sibling, child);

descriptions of tattoos and identifying marks.

Should be removed where possible, or aggregated at a threshold specified by the domain or jurisdiction. Where these data need to be retained, risk assessment of unauthorized re-identification and appropriate mitigations to identified risks of the resulting data resource shall be conducted
Aggregation variables

dates of birth and ages;

admission, discharge dates; and

location data

For statistical purposes, absolute data references should be avoided.

Dates of birth are highly identifying. Ages are less identifying but can still pose a threat for linking observational data, therefore it is better to use age groups or age types. In order to determine safe ranges, re-identification risk analysis should be run, which is outside the scope of this Technical Specification.

Admission, discharge dates, etc. can also be aggregated into types of periods, but events could be expressed relatively to a milestone (e.g., x months after treatment).

Location data, if regional codes are too specific, should be aggregated. Where location codes are structured in a hierarchical way, the finer levels can be stripped, e.g., where postal codes or dialing codes contain 20 000 or fewer people, the code may be changed to 0001)

Demographic data are indirect identifiers

language spoken at home;

person's communication language;

religion;

ethnicity;

person gender;

country of birth;

occupation;

criminal history;

person legal orders;

other addresses (e.g., business address, temporary addresses, mailing addresses);

birth plurality (second or later delivery from a multiple gestation).

Should be removed where possible, or aggregated at a threshold specified by the domain or jurisdiction. Where these data need to be retained, risk assessment of unauthorized re-identification and appropriate mitigations to identified risks of the resulting data resource shall be conducted.
Outlier variables

rare diagnoses;

uncommon procedures;

some occupations (e.g., tennis professional);

certain recessive traits uncharacteristic of the population in the information resource;

distinct deformities.

Outlier variables should be removed based upon risk assessment.
Persistent data resources claiming pseudonymity Shall be subject to routine risk analysis for potentially identifying outlier variables. This risk analysis shall be conducted at least annually. The identified risks shall be coupled with a risk mitigation strategy.
Structured data variables

vital signs;

diagnosis;

procedures; and

lab tests and results.

Structured data give some indication of what information can be expected and where it can be expected. It is then up to re-identification risk analysis to make assumptions about what can lead to (unacceptable) identification risks, ranging from simple rules of thumb up to analysis of populated databases and inference deductions. In “free text”, as opposed to “structured”, automated analysis for privacy purposes with guaranteed outcome is not possible.
Freeform text

Some examples are:

Physician notes

Referral letters

SOAP notes

Chief complaint

Nursing observations

Triage notes

Test interpretation

Susceptibility test interpretation

Impressions

Freeform text cannot be assured anonymity. All freeform text shall be subject to risk analysis and a mitigation strategy for identified risks. Re-identification risks of retained freeform text may be mitigated through:

implementation of policy surrounding freeform text content requiring that the freeform text data shall not contain directly identifiable information (e.g., patient numbers, names);

verification that freeform content is unlikely to contain identifying data (e.g., where freeform text is generated from structured text);

revising, rewriting or otherwise converting the data into coded form.

Computationally convert the freeform text into coded concepts, thus releasing the need for the freeform text.

As parsing and natural language processing "data scrubbing" and pseudonymization algorithms progress, re-identification risks associated with freeform text may merit relaxation of this assertion.

Freeform text should be revised, rewritten or otherwise converted into coded form.

Text/voice data with non-parseable content Voice recordings As with freeform text, non-parsable data should be removed.
Image data A radiology image with patient identifiers on image.

Some medical data contain identifiable information within the data. Mitigations of such identifiable data in the structured and coded DICOM header should be in accordance with DICOM PS 3.15 Annex E.

Additional risk assessment shall be considered for identifiable characteristics of the image or notations that are part of the image. See DICOM PS 3.15 Annex E