Requirements Federated Learning and mUlti-party computation Techniques for prostatE cancer
0.1.0 - ci-build
Funded by the European Union

Requirements Federated Learning and mUlti-party computation Techniques for prostatE cancer, published by HL7 Europe. This guide is not an authorized publication; it is the continuous build for version 0.1.0 built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/hl7-eu/flute-requirements/ and changes regularly. See the Directory of published versions

5.2 Methodology and Study

5.2.1 Methodology

FLUTE specific requirements are outlined based on discussions with stakeholders participating in the case studies, in particular representatives of technical partners participating in WP5 (Quibim) and medical researchers from the three participating hospitals CHUL, IRST and VHIR, which also act as data owners. Data owners are aware of ICT aspects, data protection and other requirements for the data owner nodes. The part also includes GDPR considerations for FLUTE pilot studies and the platform.

5.2.2 Primary objective

  • To validate the Barcelona Risk Calculator models (BCN-RC 1 and 2) with the inclusion of imaging biomarkers extracted from mpMRI/bpMRI for the detection of csPCa with data from multiple regions, including Barcelona (Spain), Emilia-Romagna (Italy) and Liège (Belgium).

5.2.3 Secondary objectives

  • Achieve robust performance in csPCa cancer detection, considering variations in the prevalence of the disease in different regions of Europe.
  • Develop and validate a cross-border federated AI solution on the FLUTE platform for the diagnosis of csPCa in different regions of Europe.

5.2.4 Study design

The study is an analytical observational, retrospective, and multicenter study based on 3 different European institutions: Hospital Universitari Vall d’Hebrón (HUVH, Spain), Instituto Romagnolo per lo Study dei Tumori (IRST, Italy) and Centre Hospitalier Universitaire de Liège (CHU, Belgium). The retrospective cohorts consist of men with clinical suspicion of PCa based on a PSA > 3.0 ng/ml and/or abnormal DRE, in whom the 7 clinical variables used in the Barcelona Predictive Model of csPCa are retrospectively collected. Each patient underwent a mpMRI/bpMRI reported with the Prostate Imaging-Report and Data System (PI-RADS), and a subsequent systematic and/or targeted (aimed at PI-RADS ≥3 lesions) prostate biopsy during the first year after the MRI.

From each MRI, a prospectively radiomics analysis is performed to extract the quantitative imaging biomarkers from the automatic segmentation of anatomic prostate regions/suspicious lesions with QP-Prostate, an AI-based software developed by QUIBIM.

The imaging biomarkers will be added to the 7 clinical variables of BCN-RC to create a predictive model for the prediction of csPCA in patients with clinical suspicion of PCa

5.2.5 Study population

Number of subjects:

  • HUVH-VHIR: 1963 patients
  • IRST: 400 patients
  • CHUL: 700 patients

Patient inclusion criteria

Men with clinical suspicion of PCa based on a PSA > 3.0 ng/ml and/or abnormal DRE, in whom a mpMRI/bpMRI is performed and a subsequent systematic and/or targeted prostate biopsy is done during the first year following the MRI. Lesions detected in mpMRI/bpMRI have to be reported using the Prostate Imaging-Report and Data System (PI-RADS) in version 2.0 or higher. Prostate biopsies are systematic and targeted in cases of PI-RADS ≥3 lesions.

Patient exclusion criteria

Patients without MRI images prior to the biopsy or with images obtained earlier than one year before the biopsy

List of variables The variables to be extracted from each cohort and to be included in the model are defined as follows: Endpoint variable: csPCa, defined as a PCa in prostate biopsy with an International Society of Urologic Pathology (ISUP) grade group (GG) 2 or higher. Independent variables:

  • 7 clinical variables (described below).
  • Quantitative imaging biomarkers extracted from the radiomic analysis of the mpMRI/bpMRI.

Types of variables that will be included in the model:

  • Clinical variables
Variable Description Type of data Data format Source system
AGE Age at the biopsy numeric integer Clinical History
FH PCa family history categorical 0: No; 1: Yes Clinical History
TB Type of biopsy categorical 0: initial; 2: repeated Clinical History/Procedure report
PSA PSA numeric Numeric with 1 decimal (ng/ml) Clinical History/Lab data
DRE Rectal examination categorical 0: normal; 1: suspicious Clinical History
VP MRI-prostate volume numeric Numeric with 1 decimal (cc) MRI report
PIRADS PI-RADS v.2.0 or 2.1 categorical 1 to 5 MRI report
  • MRI images
    1. Biparametric: T2W, DWI
    2. Multiparametric MRI: T2W, DWI, DCE (optional)
  • MRI biomarkers The QP-Prostate® tool will be used to process MRI images and extract biomarkers that will be used as input in the prediction algorithms. The QP-Prostate® analysis is based on T2-weighted magnetic resonance (MR) and the diffusion-weighted image (DWI). In case the dynamic contrast-enhanced (DCE) sequences are acquired, those sequences should be included and perfusion features from the image will be obtained.

For a successful launch of QP-Prostate®, the study must include the T2-weighted MR sequence and the DWI. DCE sequences are optional, but should be uploaded in case a multiparametric study has been acquired.

The analysis inclusion criteria for the T2W, DWI and DCE sequences are based on PI-RADS® v2.1 recommendations. However, QP-Prostate® software is also able to analyzed cases not compliant with PI-RADS ® v2.1, but should follow the acceptable criteria described in the Annex “Requirements QP-Prostate.pdf”. In the mentioned Annex, the requirements for running QP-Prostate® are described.

5.2.6 Experimental protocol

The variables detailed above will be extracted from EHR and PACS at CHUL, HUVH and IRST. The final databases will be stored in the FLUTE local nodes installed at each clinical site. Each FLUTE node can participate in the federated FLUTE research network. The FLUTE platform uses cryptographic methods like homomorphic encryption, SMPC, TEE and differential privacy to aggregate statistics about the cohort without leaking sensitive data outside the local node. Iteratively, the FLUTE node can train machine learning algorithms to fit the given datasets.

Within the FLUTE platform, each study is associated to a privacy budget. The FLUTE platform will ensure that the privacy budget is not exhausted. The FLUTE platform provides a graphical use interface to researchers to: a) discover and select the datasets registered at each FLUTE node; b) obtain descriptive statistics about the datasets; c) fit statistical and AI models to the datasets. The later part is provided using Jupyter notebooks integrated with the FLUTE functionalities.

First phase The first iteration will

  • Assess feasibility of use case, identify data available and variables, extract data
  • Data complies with data requirements, patient requirements, ethics approvals
  • Install QUIBIM platform
  • Platform complies with privacy and security measures for data privacy
  • Data is processed from the hospital database to the platform and back
  • Accesses and permissions are restricted to user roles
  • The automatic segmentation of lesions/anatomical zones will be validated by experienced radiologists at each center. A batch of 300 cases is estimated to be enough to assure a proper validation for the rest of the cases.
  • Install FLUTE platform
  • Identify images corresponding to the patient cases identified and send them to the analysis platform (QP-Prostate)
  • Development of synthetic data from the cases identified
    • Images and data generated mimic real data
    • If images are in a human readable format, radiologists will validate as many as the real images

Second phase The second iteration will be the development of the model(s)

  • The model(s) will determine the patient outcome (csPCa or not) using clinical data and MRI quantitative biomarkers. The model(s) will improve the accuracy of existing models BCN1 and 2.
  • The model(s)will be able to predict the risk of csPCa without the need for invasive procedures such as biopsy.
  • The model(s) will also be validated based on the different categories of PI-RADS.
  • The model(s)will be fine-tuned depending on the regional requirements of each hospital.

5.2.7 Data flow

The project aims to build a local data node at the data custodian premises. This node will be populated with data coming from different IT sub-systems of the data custodian or external (EHR, PACS, QP-Prostate, etc.). To demonstrate the capability of the platform to train a model on a specific cohort, a superset of the clinical cohort will be uploaded. E.g., Reduced sample datasets with the 7 clinical variables and associated images will be uploaded to the FLUTE node.

GDPR compliance requirements for case studies General Data Protection Regulation (GDPR) regulates the use of personal data and provides for specific requirements for fair and lawful processing of such data. The case studies will require collection and use of retrospective medical data provided by Hospital Universitari Vall d’Hebrón (VHIR, Spain), Instituto Romagnolo per lo Study dei Tumori (IRST, Italy) and Centre Hospitalier Universitaire de Liège (CHU, Belgium). Data relating to health is considered personal data under GDPR and requires an elevated level of data protection as special category of data (often referred to as ‘sensitive data’). Below we provide an overview of the legal considerations for the use of such data in the case studies.

The GDPR empowers Member States to impose derogations and exceptions in respect of GDPR obligations for particular processing activities. This allowance is extended to processing for scientific research purposes. Article 89 of the GDPR governs processing for scientific, historical or statistical purposes. Data Controllers are permitted to process data for these specific purposes where appropriate safeguards are implemented in accordance with Article 89(1). In this specific context, Article 89(2) of the GDPR allows Member States to establish further derogations from data subjects right referred to in Article 15, 16, 18 and 21, hence the legal basis for the processing will depend on the particular Data Provider:

  • VHIR will collect and contribute retrospective data i.e. data which was previously collected and is already available at the hospital. Patients have signed a general consent that allows data re-use for research purposes, the purposes always evaluated in the Ethics Committee. The Data will be pseudonymized at the IT department of the Vall d’Hebron Hospital (IT-HUVH). A table of correspondences with the original patient ID and a newly random generated ID will be created. The table will not be available to the research team or any other consortium partners. Link between images and clinical data will be provided.
  • CHUL will collect and contribute retrospective data based on Article 6.1.c in connection with Article 9.2j GDPR. CHUL will observe requirements of Article 89 §1 of the GDPR and Article 197 of the Belgian Law 30 July 2018 and will contribute pseudonymized data. CHUL will generate a pseudonym and attach it to each record as a metadata. The pseudonym is an irreversible hash of the patient ID in our EHR (using SHA-256 algorithm). The pseudonym is specific to the FLUTE project. Only CHU Liège authorized staff has access to table linking back the pseudonym to the patient’s identity.
  • IRST will collect and contribute retrospective data of cancer patients. Dataset will be minimized i.e. detailed analysis based on the use case will be performed and minimization tasks will be applied to reduce the amount of data per patients and, eventually, to discretize data that don’t need continuous values. IRST will contribute anonymized data and will only share it in a federated way. If pseudonymized cases are needed, selected patients may be approached for specific consent to use their data. All data will be reviewed before utilization.

5.2.9 Data minimization

In accordance with Article 89 §1 of the GDPR (echoed by Article 197 of the Belgian Law 30 July 2018 on the Protection of Individuals with regard to the Processing of Personal Data) the data controller using personal data for scientific research purposes must implement safeguards, which ensure technical and organizational measures to ensure the respect for data minimization. In particular in the following ways:

  • Using anonymous data if such data will achieve the purpose of the research;
  • When anonymous data processing does not achieve the research purpose, use pseudonymized data;
  • When pseudonymized data processing does not achieve the research purpose, use non-pseudonymized data.

In relation to the FLUTE project, for data contributed by CHUL and VHIR, the research objectives cannot be achieved using anonymous data. For CHUL, it is required to keep the code to allow patients to exercise their right to object and to troubleshoot ETL processes from the different hospital IT systems involved. For VHIR, anonymous data would be insufficient because images will be processed outside the hospital by QUIBIM and afterwards the results obtained need to be linked to the clinical records. Therefore, personal identifiers are coded when data is extracted from the hospital clinical IT system into the FLUTE platform.

For IRST, data will be predominantly anonymized, subject to few example cases which may need to be pseudonymized. In the latter case, patient consent will be obtained. The following activities could be performed at IRST iteratively:

  • A data quality report will be executed to evaluate “raw the quality and completeness of the extracted data”.
  • A data cleaning process will be performed to increase the quality and completeness of data, if needed, on the basis of the quality report and evaluations from researchers.
  • A data minimization process based on pre-defined criteria will be applied to the evaluate records and variables/attributes that are not usable for the specific study objectives or attributes that don’t add any informative value.
  • A final data anonymization process will be performed to reduce the risk of subject re-identification. All attributes will be categorized in:
    • Direct identifying attributes (DI): explicit links to data subject (NHS number, national insurance number, biometric residence permit number, and email address)
    • Quasi identifying attributes (QID): a set of attributes that, in combination, can be linked with external information to reidentify the subject (gender, date of birth, postcode, and ethnic origin)
    • Sensitive attributes (SA): not useful with respect to the determination of the patient’s identity, they contain sensitive health-related information about patients (pathology, clinical drug dosage)
    • Other attributes (OA): represent variables that are not considered sensitive and would be difficult for an adversary to use for re identification

Moreover, the FLUTE project starts from the idea that sensitive data must not leave the premises of the data owners (hospitals). Thus, once the data is prepared by the hospital, it resides on a server (called ‘data owner node’) protected by the hospital’s own infrastructure. This server via FLUTE platform then exchanges encrypted messages with other data owners to collaboratively compute aggregates (e.g., averages of attributes or gradients, or other statistics) in such a way that under the security assumptions: (i) no sensitive information can be revealed from these exchanged messages;(ii) only a privatized version of the aggregate/model/statistic/etc. the data owners agreed in advance to compute can be revealed.

The specific exceptions to this rule, required to achieve the purposes of the project, shall be defined by the project.

5.2.10 Data processing agreements

The Consortium Agreement of the FLUTE project outlines the rights and obligation of the partners. A data protection impact analysis will be conducted with the hospitals’ DPOs to identify whether additional agreements are needed for data processing. Supporting data sharing agreements will be drafted and executed prior to data sharing.

GDPR impact on the functional requirements of the Platform In this section we build on the legal requirements which identified in Section 5.3 of the Trumpet deliverable 1.1. and provide additional comments translating the specified principles into functional requirements applicable to the Platform.

Article 5 of the GDPR provides for accountability by stipulating that the controller shall be responsible for, and be able to demonstrate compliance with the data protection principles of “lawfulness, fairness and transparency”, “purpose limitation”, “data minimization”, “accuracy”, “storage limitation” and security (“integrity and confidentiality”)”. Accountability means that entities responsible for the processing of data must be identified, and that appropriate controls (such as logs) are available to ensure that any problems can be attributed to the correct entity.

According to the principle of integrity and confidentiality (also referred to as “data security”, as elaborated in Article 32 GDPR), data must be protected by appropriate technical and organizational measures to ensure its confidentiality, integrity and availability. Data security refers to the layer of security in an information system that is devoted to adding protections to the data in the system itself and controlling the access to the data through identity and access management. FLUTE needs to implement security requirements that are appropriate to the sensitivity of the information at and that take into account the system and its components as well as the data accessible through the system and the identified risks. In particular, the risks may include:

  • Breach of confidentiality – unauthorized disclosure of information that an individual or organization wishes to keep confidential.
  • Breach of individual privacy and autonomy – access to and use of an individual’s health-related data without the appropriate knowledge or consent of the individual concerned, or for purposes the individual has not authorized.
  • Malicious or accidental corruption or destruction of health-related data
  • Disruption in availability of data and services necessary to maintain appropriate access to health-related data.
  • Unethical, illegal, or inappropriate actions that attempt to breach security controls, surreptitiously obtain or derive information in an unauthorized manner, or otherwise undermine the trust of FLUTE platform.

Given the above risks, in the context of FLUTE, examples of the measures may include:

  • Authentication and authorization processes, as well as approval procedures including granting access rights to a system by the system owner. All processing activities undertaken within FLUTE must be subject to authentication and authorization processes in a way that allows the patient data to be protected against access or use by unauthorized parties.
  • Data should be de-identified and encrypted, when possible.
  • Personal data should only be transferred via secure online channels. This can be achieved via trusted networks, using a channel where data is encrypted or equivalent means.
  • All processing activities undertaken within FLUTE must be subject to logging processes in a way that allows access, modification and deletion of the data to be detected and audited.
  • All processing activities undertaken within FLUTE must be subject to appropriate storage and retention controls, ensuring that all data is stored in a manner that renders them accessible only under the control of relevant partners.
  • If a breach of personal data in the FLUTE project occurs (i.e. a breach of security leading to the accidental or unlawful destruction, loss, alteration, unauthorized disclosure of, or access to, personal data), the FLUTE Project Coordinators must be informed by the affected FLUTE partner(s) without undue delay after becoming aware of it, so that appropriate follow-up actions can be undertaken.
FLUTE_validationPhase1
Figure 2: FLUTE validation phase 1
FLUTE_validationPhase2
Figure 3: FLUTE validation phase 2