De-Identification Profile
0.0.1-current - ci-build
De-Identification Profile, published by IHE IT Infrastructure Technical Committee. This guide is not an authorized publication; it is the continuous build for version 0.0.1-current built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/IHE/ITI.DeIdHandbook/ and changes regularly. See the Directory of published versions
The semantic type of each data element determines the algorithm or algorithms to apply to that element. Below we discuss various types of data.
This table can be used as a starting point. There are also standard specifications available (e.g., DICOM PS3.15 Annex E, see Appendix B of this document) that take this high level categorization and expand it to the individual attributes for particular kinds of data. Profile writers and others should extend these tables with any data types that are specific to their intended use.
Table: Data types
| Data types | Examples | Approaches |
|—————————|———————————————————————|———————————————————————-|
| Person identifying direct identifiers| person's name (including preferred name, legal name, other names by which the person is known), by name, we are referring to the name and all name data elements as specified in ISO/TS 22220;
person identifiers (including, for example, issuing authorities, types, and designations such as patient account number, medical record number, certificate/license numbers, social security number, health plan beneficiary numbers, vehicle identifiers and serial numbers, including license plate numbers); biometrics (voice prints, finger prints, photographs, etc.);
digital certificates that identify an individual;
mother's maiden name and other similar relationship-based concept (e.g., family links);
residential address;
electronic communications (telephone, mobile telephone, fax, pager, e-mail, URL, IP addresses, device identifiers, message control IDs, and device serial numbers); subject of care linkages (mother, father, sibling, child);
descriptions of tattoos and identifying marks.| Should be removed where possible, or aggregated at a threshold specified by the domain or jurisdiction. Where these data need to be retained, risk assessment of unauthorized re-identification and appropriate mitigations to identified risks of the resulting data resource shall be conducted.|
| Aggregation variables| dates of birth and ages;
admission, discharge dates; and
location data| For statistical purposes, absolute data references should be avoided.
Dates of birth are highly identifying. Ages are less identifying but can still pose a threat for linking observational data, therefore it is better to use age groups or age types. In order to determine safe ranges, re-identification risk analysis should be run, which is outside the scope of this Technical Specification.
Admission, discharge dates, etc. can also be aggregated into types of periods, but events could be expressed relatively to a milestone (e.g., x months after treatment).
Location data, if regional codes are too specific, should be aggregated. Where location codes are structured in a hierarchical way, the finer levels can be stripped, e.g., where postal codes or dialing codes contain 20 000 or fewer people, the code may be changed to 0001)|
| Demographic data are indirect identifiers| language spoken at home;
person's communication language;
religion;
ethnicity;
person gender;
country of birth;
occupation;
criminal history;
person legal orders;
other addresses (e.g., business address, temporary addresses, mailing
addresses);
birth plurality (second or later delivery from a multiple gestation).
| Should be removed where possible or aggregated at a threshold specified by the domain or jurisdiction. Where these data need to be retained, risk assessment of unauthorized re-identification and appropriate mitigations to identified risks of the resulting data resource shall be conducted.|
| Outlier variables| rare diagnoses;
uncommon procedures;
some occupations (e.g., tennis professional);
certain recessive traits uncharacteristic of the population in the
information resource;
distinct deformities.|Outlier variables should be removed based upon risk assessment.|
|Persistent data resources claiming pseudonymity||Shall be subject to routine risk analysis for potentially
identifying outlier variables. This risk analysis shall be conducted at
least annually. The identified risks shall be coupled with a risk
mitigation strategy.|
|Structured data variables|vital signs;
diagnosis;
procedures; and
lab tests and results.|Structured data give some indication of what information can be
expected and where it can be expected. It is then up to
re-identification risk analysis to make assumptions about what can lead
to (unacceptable) identification risks, ranging from simple rules of
thumb up to analysis of populated databases and inference deductions. In
“free text”, as opposed to “structured”, automated analysis for privacy
purposes with guaranteed outcome is not possible.|
|Freeform text|Some examples are:
Physician notes
Referral letters
SOAP notes
Chief complaint
Nursing observations
Triage notes
Test interpretation
Susceptibility test interpretation
Impressions|Freeform text cannot be assured anonymity. All freeform text
shall be subject to risk analysis and a mitigation strategy for
identified risks. Re-identification risks of retained freeform text may
be mitigated through:
implementation of policy surrounding freeform text content requiring
that the freeform text data shall not contain directly identifiable
information (e.g., patient numbers, names);
verification that freeform content is unlikely to contain identifying
data (e.g., where freeform text is generated from structured text);
revising, rewriting or otherwise converting the data into coded
form.
Computationally convert the freeform text into coded concepts, thus
releasing the need for the freeform text.
As parsing and natural language processing "data scrubbing" and
pseudonymization algorithms progress, re-identification risks associated
with freeform text may merit relaxation of this assertion.
Freeform text should be revised, rewritten or otherwise converted
into coded form.|
|Text/voice data with non-parseable content|Voice recordings|As with freeform text, non-parsable data should be removed.|
|Image data|A radiology image with patient identifiers on image.|Some medical data contain identifiable information within the
data. Mitigations of such identifiable data in the structured and coded
DICOM header should be in accordance with DICOM PS 3.15 Annex E.
Additional risk assessment shall be considered for identifiable
characteristics of the image or notations that are part of the image.
See DICOM PS 3.15 Annex E|
Data types | Examples | Approaches |
---|---|---|
Person identifying direct identifiers. | person's name (including preferred name, legal name, other names by which the person is known); by name, we are referring to the name and all name data elements as specified in ISO/TS 22220; person identifiers (including, e.g., issuing authorities, types, and designations such as patient account number, medical record number, certificate/license numbers, social security number, health plan beneficiary numbers, vehicle identifiers and serial numbers, including license plate numbers); biometrics (voice prints, finger prints, photographs, etc.); digital certificates that identify an individual; mother's maiden name and other similar relationship-based concept (e.g., family links); residential address; electronic communications (telephone, mobile telephone, fax, pager, e-mail, URL, IP addresses, device identifiers, message control IDs, and device serial numbers); subject of care linkages (mother, father, sibling, child); descriptions of tattoos and identifying marks. |
Should be removed where possible, or aggregated at a threshold specified by the domain or jurisdiction. Where these data need to be retained, risk assessment of unauthorized re-identification and appropriate mitigations to identified risks of the resulting data resource shall be conducted |
Aggregation variables | dates of birth and ages; admission, discharge dates; and location data |
For statistical purposes, absolute data references should be avoided. Dates of birth are highly identifying. Ages are less identifying but can still pose a threat for linking observational data, therefore it is better to use age groups or age types. In order to determine safe ranges, re-identification risk analysis should be run, which is outside the scope of this Technical Specification. Admission, discharge dates, etc. can also be aggregated into types of periods, but events could be expressed relatively to a milestone (e.g., x months after treatment). Location data, if regional codes are too specific, should be aggregated. Where location codes are structured in a hierarchical way, the finer levels can be stripped, e.g., where postal codes or dialing codes contain 20 000 or fewer people, the code may be changed to 0001) |
Demographic data are indirect identifiers | language spoken at home; person's communication language; religion; ethnicity; person gender; country of birth; occupation; criminal history; person legal orders; other addresses (e.g., business address, temporary addresses, mailing addresses); birth plurality (second or later delivery from a multiple gestation). |
Should be removed where possible, or aggregated at a threshold specified by the domain or jurisdiction. Where these data need to be retained, risk assessment of unauthorized re-identification and appropriate mitigations to identified risks of the resulting data resource shall be conducted. |
Outlier variables | rare diagnoses; uncommon procedures; some occupations (e.g., tennis professional); certain recessive traits uncharacteristic of the population in the information resource; distinct deformities. |
Outlier variables should be removed based upon risk assessment. |
Persistent data resources claiming pseudonymity | Shall be subject to routine risk analysis for potentially identifying outlier variables. This risk analysis shall be conducted at least annually. The identified risks shall be coupled with a risk mitigation strategy. | |
Structured data variables | vital signs; diagnosis; procedures; and lab tests and results. |
Structured data give some indication of what information can be expected and where it can be expected. It is then up to re-identification risk analysis to make assumptions about what can lead to (unacceptable) identification risks, ranging from simple rules of thumb up to analysis of populated databases and inference deductions. In “free text”, as opposed to “structured”, automated analysis for privacy purposes with guaranteed outcome is not possible. |
Freeform text | Some examples are: Physician notes Referral letters SOAP notes Chief complaint Nursing observations Triage notes Test interpretation Susceptibility test interpretation Impressions |
Freeform text cannot be assured anonymity. All freeform text shall be subject to risk analysis and a mitigation strategy for identified risks. Re-identification risks of retained freeform text may be mitigated through: implementation of policy surrounding freeform text content requiring that the freeform text data shall not contain directly identifiable information (e.g., patient numbers, names); verification that freeform content is unlikely to contain identifying data (e.g., where freeform text is generated from structured text); revising, rewriting or otherwise converting the data into coded form. Computationally convert the freeform text into coded concepts, thus releasing the need for the freeform text. As parsing and natural language processing "data scrubbing" and pseudonymization algorithms progress, re-identification risks associated with freeform text may merit relaxation of this assertion. Freeform text should be revised, rewritten or otherwise converted into coded form. |
Text/voice data with non-parseable content | Voice recordings | As with freeform text, non-parsable data should be removed. |
Image data | A radiology image with patient identifiers on image. | Some medical data contain identifiable information within the data. Mitigations of such identifiable data in the structured and coded DICOM header should be in accordance with DICOM PS 3.15 Annex E. Additional risk assessment shall be considered for identifiable characteristics of the image or notations that are part of the image. See DICOM PS 3.15 Annex E |