Family Planning Example

De-Identification Profile, published by IHE IT Infrastructure Technical Committee. This guide is not an authorized publication; it is the continuous build for version 0.0.1-current built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/IHE/ITI.DeIdHandbook/ and changes regularly. See the Directory of published versions

Purpose and Overview
Intended Audience
Context Analysis
Data Assessment
De-identification Goals
Risk Assessment
Risk Mitigation Design
Implementation and Validation
Governance
Appendix A: Sample FP CDA documents and their De-Identified documents
Appendix B: Usability

This section, the IHE IT Infrastructure (ITI) Analysis of Optimal De-Identification for Family Planning Data Elements, demonstrates the application of the IHE De-Identification Handbook's systematic process framework. It describes the comprehensive de-identification analysis performed by the ITI Technical Committee for the Family Planning Annual Report (FPAR) use case published in the IHE Quality, Research, and Public Health (QRPH) Family Planning version 2 (FPv2) Trial Implementation Supplement, Rev. 1.4 (December 29, 2021).

Purpose and Overview

This document serves as both a practical demonstration of the de-identification process framework and a complete de-identification profile for Family Planning data in the Title X FPAR use case. As described in the IHE Profile Development Guidance, a de-identification profile specifies the source data type and defines how each data element will be removed, modified, or preserved during de-identification.

This de-identification profile demonstrates:

Context Analysis: Clearly defining the data collection purpose, recipients, and data flow within the Title X family planning service network
Data Assessment: Comprehensive evaluation of data content, including attribute classification using the updated Data Types framework (Direct Identifiers, Quasi-Identifiers, Sensitive Attributes)
Goal Determination: Establishing specific de-identification objectives that balance clinical and privacy perspectives
Risk Assessment: Both qualitative and quantitative evaluation of re-identification risks
Risk Mitigation Design: Multi-stage de-identification architecture with element-by-element technique selection using techniques from the Algorithms chapter
Implementation and Validation: Process validation and governance framework
Identifiability Transitions: Demonstrating the transformation from Identified Data through Reversible-Pseudonymized Data to achieve the target identifiability level as defined in Concepts

As a de-identification profile, this document fulfills the requirements outlined in IHE Profile Development Guidance by:

Identifying data requirements for the Title X FPAR reporting class of use cases
Identifying risks that can be removed through de-identification techniques
Specifying residual risks requiring additional protections (controlled access, data use agreements)
Defining element-by-element treatment for all data elements in Family Planning CDA documents

The detailed design questions that guide technique selection are organized by attribute type in the Project and Data Details subsection under De-identification Goals.

Intended Audience

This implementation guide serves three primary audiences:

Software Developers and Implementers: Those who will implement the de-identification techniques into their software systems. This de-identification profile serves as the specification for what transformations to apply to each data element. Developers should use the IHE QRPH Family Planning version 2 (FPv2) supplement for source document structure, this profile for de-identification specifications, and the Algorithms chapter for implementation techniques.
Privacy and Security Professionals: Those responsible for designing, validating, and governing de-identification processes. This document demonstrates the application of the systematic Process framework, including context analysis, risk assessment, and mitigation design. It serves as a template for other de-identification projects.
Clinicians, Researchers, and Data Analysts: Those who seek to understand how and why specific de-identification techniques were selected for each data element, and how the resulting dataset maintains utility for public health reporting while protecting patient privacy. This audience will benefit from understanding the Data Types classification and the Concepts of identifiability levels.

Context Analysis

This section analyzes the environment in which Family Planning data is collected, processed, and shared, following the framework established in the Process chapter.

Purpose of Data Collection

The intended use of the de-identified data determines the extent of de-identification and acceptable risk levels. The primary purpose for collecting Family Planning data is:

Primary Purpose: To support the U.S. Office of Population Affairs (OPA) Title X Family Planning Annual Reports (FPAR) for:

Program planning and budgeting at federal and grantee levels
Basic monitoring of program performance and adherence to funded project scope
Clinical quality improvement initiatives
Ensuring clients receive access to a broad range of family planning services and methods
Verifying services are delivered to intended populations
Budget justification to Congress for family planning service funding
Allocation of funding to address unmet needs in specific geographic areas

Scope Constraints: The de-identified dataset is specifically designed for public health reporting and program evaluation. It is not intended to support:

General-purpose research requiring high-fidelity individual-level data
Clinical treatment or patient care (data has already been used for that purpose)
Re-identification for adverse event notification (not applicable to this non-interventional use case)

Data Recipients

The de-identified FPAR data will be accessed by the following categories of recipients:

Organizational Recipients:

OPA headquarters staff (federal employees)
Title X grantees (approximately 89 organizations)
Sub-recipient agencies under Title X grantees
Service sites within the Title X network (approximately 4,100 sites)

Recipient Profiles:

Public health professionals and program managers responsible for Title X program oversight
Data analysts conducting performance measurement and quality improvement
Grant administrators monitoring compliance and resource allocation

Relationship to Data Custodian: The data flows from service sites to sub-recipients to grantees to OPA, creating a hierarchical reporting structure. All recipients operate under Title X grant agreements with confidentiality obligations.

Background Knowledge Assessment: Recipients have varying levels of background knowledge:

High knowledge: Service site staff may know individual patients in their local community
Medium knowledge: Grantee-level staff may have regional demographic knowledge
Lower knowledge: OPA headquarters staff have limited local community knowledge
External risk: With potential access numbering in the thousands, the risk posture requires treating this as a controlled-public sharing model rather than fully internal use

Data Flow

The de-identification architecture for Family Planning data follows a centralized model to ensure consistency and apply specialized expertise:

Original Data Source: Family Planning CDA documents generated at approximately 4,100 service sites that provide family planning and related preventive health services

De-identification Architecture: Based on public comment feedback, a single, centralized de-identification intermediary was selected over multiple distributed de-identification points. This decision was driven by:

Need for consistent pseudonymization across all sources
Requirement for specialized expertise in privacy-preserving transformations
Maintaining accuracy and longitudinal consistency
Centralized application of risk assessment methodologies

Data Transformation Process: The de-identification process follows format-specific best practices by maintaining de-identified data in native clinical document formats (CDA or FHIR) throughout the de-identification stages, with CSV serving as a derivative format for FPAR reporting:

Input: Family Planning CDA documents from 4,100 service sites (Identified Data per Concepts)
Stage 1 - Preliminary De-identification:
- Performed at source or early in pipeline
- Applies basic transformations to remove direct identifiers
- Uses reversible pseudonymization
- Output: De-identified CDA/FHIR documents with format-specific indicators (e.g., CDA nullFlavor="MSK" for masked elements, FHIR DataAbsentReason extension for removed data)
Stage 2 - Advanced De-identification:
- Performed by centralized intermediary
- Input: Stage 1 de-identified CDA/FHIR documents
- Applies:
  - Irreversible pseudonymization for patient identifiers
  - Generalization for quasi-identifiers (dates, geographic data)
  - Suppression for unnecessary direct identifiers
  - Risk assessment and validation
- Output: Fully de-identified CDA/FHIR documents with all transformations documented using format-specific metadata
CSV Derivative Generation:
- Extract relevant data elements from Stage 2 de-identified CDA/FHIR documents
- Transform to CSV format where each row represents a de-identified Family Planning visit
- CSV does not preserve format-specific de-identification metadata; provenance is maintained in source documents and audit records
Distribution: De-identified CSV files to OPA and authorized Title X network recipients; de-identified CDA/FHIR documents retained for audit and compliance

Figure: De-Identification for Family Planning Process Diagram

Data Flow Diagram: The complete data flow in the Title X FPAR use case is illustrated in the figure below. The diagram reflects a multi-stage de-identification process, where Family Planning CDA documents move from service sites through preliminary de-identification (Stage 1), then through a centralized advanced de-identification intermediary (Stage 2), before reaching final recipients. Each stage applies distinct privacy-preserving transformations, as described in the text above.

Figure: Multi-Stage De-Identification Data Flows in the Title X Family Planning Annual Report Use Case
This diagram illustrates the two-stage de-identification architecture: Stage 1 (Preliminary De-identification) occurs at the source or early in the pipeline, and Stage 2 (Advanced De-identification) is performed by a centralized intermediary. Data flows and process points in the diagram correspond to these stages, showing how privacy risk is progressively reduced before data reaches authorized recipients.

Important Note: External submissions use Family Planning CDA documents only. Internal processing may maintain de-identified CDA or FHIR representations. Other document types submitted via similar workflows are out of scope for this analysis.

Regulatory and Policy Context

Applicable Regulations:

HIPAA Privacy Rule (45 CFR Part 160 and Subparts A and E of Part 164)
Title X of the Public Health Service Act
Federal grant requirements for OPA Title X program

Data Sensitivity: Family Planning data involves sensitive reproductive health information requiring enhanced privacy protection for a potentially vulnerable patient population.

Data Assessment

This section provides a comprehensive evaluation of the Family Planning data following the assessment framework in the Process chapter.

Data Subject Characteristics

Population: Individuals receiving family planning services at Title X-funded clinics

Key Characteristics:

Age Range: Predominantly reproductive age (15-44 years), including adolescents and older individuals
Geographic Distribution: Nationwide across U.S. states and territories, concentrated in underserved areas
Gender: Predominantly female patients, with male patients also served
Socioeconomic Status: Primarily low-income (Title X targets ≤250% federal poverty level)
Vulnerable Population Status: Enhanced re-identification risks due to potential social stigma, concentrated service delivery, and overlapping demographic characteristics

Data Type and Content

Structured Data (Primary format): Family Planning CDA documents containing:

Demographics: age, gender, race, ethnicity, language
Socioeconomic: income, health insurance status
Clinical: pregnancy history, services provided
Identifiers: patient, provider, facility IDs
Temporal: dates of birth, visit dates
Geographic: ZIP codes, facility locations

Free Text: Limited presence; requires NLP-based identifier scrubbing

Format: CDA/XML requiring XML parsing capabilities for de-identification processing

Data Attribute Classification

Following the Data Types framework:

Direct Identifiers: Patient Name (excluded from FPAR), Facility Name (transformed to ID), Provider Name (transformed to ID)

Quasi-Identifiers: Date of Birth/Age, Visit Date, Sex, Race, Ethnicity, ZIP Code, Language, Pregnancy History, Income, Insurance Status

Sensitive Attributes: Services Received, Clinical Procedures, Pregnancy Outcomes, STI Test Results

Dataset Properties

Currency: Operational data; annual FPAR compilation
Scale: ~3.9 million patients annually from 4,100 service sites
Quality: High-quality structured data with standard terminologies (SNOMED CT, LOINC, ICD, CDC PHIN VADS)
Longitudinal Characteristic: Multiple visits per patient increase re-identification risk

Attack Modeling

Sharing Model: Controlled Public Sharing (Data Use Agreement)

Accessible to thousands of authorized Title X participants
Federal grant agreements prohibit re-identification
Risk level: Moderate-to-high protection required

Attack Types: Identity attacks (journalist risk), attribute attacks on small groups, membership attacks

Privacy Model: k-anonymity (provides measurable, interpretable guarantees for quasi-identifier linkage attacks)

Terminology Reference

This analysis uses core terminology defined in the Concepts chapter, including de-identification, pseudonymization, anonymization, and identifiability levels. For complete definitions, refer to that chapter.

Domain-Specific Terms:

PHIN VADS: Public Health Information Network Vocabulary Access and Distribution System from CDC (https://phinvads.cdc.gov/vads/SearchVocab.action) - standard code sets for demographics
FPAR: Family Planning Annual Report required for Title X program
Title X: Federal grant program (Public Health Service Act) for family planning services

De-identification Goals

This section establishes specific objectives for the FPAR de-identification process per the Process framework, balancing privacy protection with data utility.

General Goals

Prevent Identification: Remove/transform direct identifiers; reduce combinability of quasi-identifiers
Control Risk: Achieve acceptable residual risk for controlled-public sharing
Preserve Utility: Maintain fidelity for FPAR reporting, performance measurement, trend analysis, and geographic analysis

Specific FPAR Goals

Scope: U.S. Title X context; reporting purpose only (NOT general research); protects vulnerable reproductive health patients

Entity De-identification: Patients (primary), facilities and providers (secondary, to prevent indirect identification)

Data Fidelity: Maintain aggregate accuracy, preserve longitudinal relationships, support geographic analysis

Identifiability Target: Irreversibly Pseudonymized Data (Concepts) - stronger than reversible pseudonymization, practical for longitudinal utility

Risk Threshold: Average re-identification risk ≤ 0.05 (5%); k-anonymity with k ≥ 20 as working target

Rationale: Non-public controlled sharing + medium-high attack possibility + medium-high impact (sensitive data, vulnerable population) = conservative threshold needed

Project and Data Details (attribute-specific design questions)

To select techniques per element, the following questions are applied by attribute type:

Direct Identifiers (DI)

Can it be deleted entirely?
If retained, can it be pseudonymized (reversible or irreversible)?

Categorical Quasi-Identifiers (QI) (e.g., race, ethnicity, language, insurance)

Can the value be generalized or grouped (e.g., broader categories)?
Can rare categories be suppressed or recoded?
Is code substitution acceptable (meaningful vs. meaningless)?

Numerical Quasi-Identifiers (QI) (e.g., income, counts)

Can the value be truncated, rounded, top/bottom coded, or statistically adjusted?
Can outliers be replaced with floor/ceiling values?

Temporal Quasi-Identifiers (e.g., DOB, visit date, time of day)

Can dates be collapsed (e.g., to month/year) or shifted by random offset?
Can time of day be generalized (e.g., weekend/weekday)?
Must longitudinal consistency be maintained after transformation?

Geographic Quasi-Identifiers (e.g., ZIP, facility location)

Can geography be generalized (e.g., 3-digit ZIP, county/region)?
Must longitudinal/geographic consistency be maintained across records?

Free Text / Sensitive Attributes

Can free text be removed or NLP-scrubbed? Can attributes be made fuzzier (noise/random code set)?
Can values be substituted with pseudorandom or sequential recoverable values when utility requires?

Iterative Review

After initial selections, perform holistic review to ensure transformations reduce risk without over-suppressing utility; document usability and threat analyses.

Balancing Clinical and Privacy Perspectives

Clinical/Public Health: Maintain FPAR performance measures, analytical utility for program evaluation

Privacy/Security: Protect vulnerable patients, minimize combinable quasi-identifiers, account for thousands of potential recipients

Use Cases from FPv2 Supplement

OPA requires the collection of family planning service delivery data in the form of the FPAR as a condition of its grant awards. The office uses the data for program planning and budgeting, monitoring program performance, clinical quality improvement, budget justification to Congress, and allocation of funding to address unmet need for family planning services in specific areas of the U.S. and its territories.

The IHE QRPH Family Planning version 2 (FPv2) supplement defines five use cases for family planning data collection and reporting:

Use Case #1: FP Manual Data Entry - Manual data entry into family planning forms at service sites
Use Case #2: FP with Pre-pop Option - Forms pre-populated with existing patient data from clinical systems
Use Case #3: FP with Pre-pop Option with Supplemental Data - Pre-populated forms supplemented with additional clinical data sources
Use Case #4: Forms Data Capture with Document Submission - Structured data capture with subsequent document submission
Use Case #5: EHR FP Document Submission - Direct submission of family planning documents from EHR systems

All five use cases involve collection of family planning data from Title X grantees, sub-recipients, and service sites that provide a wide range of family planning and related preventive health services. The de-identification analysis in this document applies to all FPv2 use cases.

Important Scope Limitation: This analysis is specific to the Title X FPAR context in the United States. The conclusions regarding optimal de-identification techniques relate exclusively to this use case. Organizations wishing to utilize these data elements in other programs must conduct their own de-identification analysis, considering local needs and applicable legislation.

The identified data described in the FPv2 supplement is used for clinical purposes at the point of care. A de-identified data set is needed for reporting and performance measurement purposes. The de-identified data set is not intended to be suitable for general research purposes, as that would result in too broad and identifiable a data set. Data elements that may be useful for some research purposes may be redacted or segregated into separate reports to reduce risk to vulnerable patients.

For purposes of risk analysis and exposure of the de-identified data set, our assumptions include:

Data is collected by the up to 4,100 service sites that comprise the Title X network
Data is de-identified by a single, central de-identification third party
Data is submitted in a de-identified manner to OPA
De-identified data is made available to authorized staff from OPA headquarters and staff from OPA’s family planning service grantees and their subrecipient agencies

The risk posture of this data set is not the same as making the data publicly available; however, with potential access numbering in the thousands, securing this data set is a significant challenge that must be considered during the de-identification process.

Architectural Context: Since the FPAR data will have already been used to provide treatment and services to the patient at the point of care, the de-identified data is not needed for clinical purposes. The de-identified data supports program evaluation, performance measurement, and policy decisions for the Title X network.

From an architectural perspective, the FPAR use case depends on de-identification being performed prior to submission to the host organization. This means de-identification could be conducted by a third party intermediary performed at the source EHR. However, multiple points and levels of de-identification pose a risk to the accuracy and longitudinal consistency of the data and therefore after public comment feedback a single, centralized de-identification third party architecture was agreed upon.

Element-level De-identification Rules

This section constitutes the core of the de-identification profile for Family Planning data. As specified in IHE Profile Development Guidance, it provides detailed treatment specifications for each data element, identifying what will be removed, modified, or preserved. This element-by-element specification enables implementers to apply consistent de-identification across the Title X network while understanding the rationale for each decision.

Analysis Methodology

Each element analysis follows a structured approach integrating concepts from the updated handbook:

Technique Selection: Choosing appropriate techniques from the Algorithms catalog:
- Suppression techniques (masking, local suppression)
- Pseudonymization techniques (irreversible hashing)
Risk Contribution: Assessing how the element contributes to re-identification risk (particularly for quasi-identifiers)
Threat Analysis: Considering specific attack scenarios and background knowledge that could enable re-identification
Usability Validation: Confirming the transformed element maintains sufficient utility for intended purposes

The following subsections analyze each data element in the IHE QRPH Family Planning Profile. For each element, the analysis documents the decision process leading to the selected technique, balancing the clinical/analytical requirements against privacy protection needs.

Element-level Rules

Table 2.6-1: Element-level De-identification Rules

Note: Title terminology is preserved from the source white paper. In this profile, these “algorithms” correspond to selected de-identification techniques.

Element	Patient Id type	De-Identification algorithm
Facility identifier	Indirect	Mapping table
Clinical Provider identifier	Indirect	Mapping table
Patient identifier	Direct	Mapping table
Visit Date	Indirect	Generalized to week of year plus indicator of visit order
Date of Birth	Indirect	Convert to age in years. For clients over 50, grouped and mapped to "over 50".
Administrative Sex	Indirect	For values of "Male" or "Female" forward the data unchanged. For Administrative Sex values of "other" change them to "Female"
Pregnancy History	Indirect	Redacted
Limited Language Proficiency	Indirect	Collapse all forms to Limited English Proficiency (LEP) TRUE or LEP FALSE.
Ethnicity	Indirect	Only the values "2186-5 Not Hispanic or Latino" or "2135-2 Hispanic or Latino" may be used. Any other input value must be converted to "2186-5 Not Hispanic or Latino".
Race	Indirect	Collapse to 5 OMB categories plus Other. For each county, establish which races are below the threshold of 50 people per county. For those races, group them into "Other"
Annual Household Income	Indirect	Convert to percentage of Federal Poverty Level (FPL)
Household Size	Data	Convert to percentage of Federal Poverty Level (FPL)
Visit Payer (U.S. Only)	Indirect	Convert to Public Health Information Network (PHIN) Vocabulation Access and Distribution System (VADS)
Current Pregnancy Status	Indirect	Generalize to YES/NO/UNKNOWN
Pregnancy Intention	Data	Unchanged
Sexual Activity	Data	Unchanged
Contraceptive Method at Intake	Data	Unchanged.
Reason for no contraceptive method	Data	Unchanged.
Contraceptive Method at Exit	Data	Unchanged.
Date of Last Pap test	Indirect	Redact the day of the month, and use Week and Year only in the format of yyyyWww where week 52 of 2014 would appear 2014W52
HPV Co-test Ordered	Indirect	Redact the day of the month, and use Week and Year only in the format of yyyyWww where week 52 of 2014 would appear 2014W52
CT Screen Ordered	Indirect	Redact the day of the month, and use Week and Year only in the format of yyyyWww where week 52 of 2014 would appear 2014W52
GC Screen Ordered	Indirect	Redact the day of the month, and use Week and Year only in the format of yyyyWww where week 52 of 2014 would appear 2014W52
HIV Screen Ordered	Indirect	Redact the day of the month, and use Week and Year only in the format of yyyyWww where week 52 of 2014 would appear 2014W52
HIV Rapid Screen Result	Indirect	Delete. HIV reporting will be handled separately.
HIV Supplemental Result	Indirect	Delete. HIV reporting will be handled separately.
Referral Recommended Date	Indirect	Delete. HIV reporting will be handled separately.
Referral Visit Completed Date	Indirect	Delete HIV referrals. HIV reporting is required for the HHS HIV linkage to care performance measure, however HIV data is sensitive and the HIV pools sufficiently small that a separate mechanism will be established for reporting on these data, such as reporting these values to a separate aggregate database. For non-HIV referrals redact the day of the month and use Month and Year only
Systolic blood pressure	Data	Unchanged
Diastolic blood pressure	Data	Unchanged
Height	Indirect	Unchanged, except for values below 59 inches or above 76 inches. For values below 59 inches, convert to 59 inches. For values above 76 inches, convert to 76 inches
Weight	Indirect	Unchanged, except for values below 100 lbs. or above 299 lbs. For values below 100 lbs., convert to 100 lbs. For values above 299 lbs., convert to 299 lbs.
Smoking status	Indirect	Unchanged

Element

Patient Id type

De-Identification algorithm

Facility identifier

Indirect

Mapping table

Clinical Provider identifier

Indirect

Mapping table

Patient identifier

Direct

Mapping table

Visit Date

Indirect

Generalized to week of year plus indicator of visit order

Date of Birth

Indirect

Convert to age in years. For clients over 50, grouped and mapped to "over 50".

Administrative Sex

Indirect

For values of "Male" or "Female" forward the data unchanged. For Administrative Sex values of "other" change them to "Female"

Pregnancy History

Indirect

Redacted

Limited Language Proficiency

Indirect

Collapse all forms to Limited English Proficiency (LEP) TRUE or LEP FALSE.

Ethnicity

Indirect

Only the values "2186-5 Not Hispanic or Latino" or "2135-2 Hispanic or Latino" may be used. Any other input value must be converted to "2186-5 Not Hispanic or Latino".

Race

Indirect

Collapse to 5 OMB categories plus Other. For each county, establish which races are below the threshold of 50 people per county. For those races, group them into "Other"

Annual Household Income

Indirect

Convert to percentage of Federal Poverty Level (FPL)

Household Size

Data

Convert to percentage of Federal Poverty Level (FPL)

Visit Payer (U.S. Only)

Indirect

Convert to Public Health Information Network (PHIN) Vocabulation Access and Distribution System (VADS)

Current Pregnancy Status

Indirect

Generalize to YES/NO/UNKNOWN

Pregnancy Intention

Data

Unchanged

Sexual Activity

Data

Unchanged

Contraceptive Method at Intake

Data

Unchanged.

Reason for no contraceptive method

Data

Unchanged.

Contraceptive Method at Exit

Data

Unchanged.

Date of Last Pap test

Indirect