Requirements Federated Learning and mUlti-party computation Techniques for prostatE cancer
0.1.0 - ci-build
Funded by the European Union

Requirements Federated Learning and mUlti-party computation Techniques for prostatE cancer, published by HL7 Europe. This guide is not an authorized publication; it is the continuous build for version 0.1.0 built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/hl7-eu/flute-requirements/ and changes regularly. See the Directory of published versions

4.3 Queries to be supported by the researcher/innovator interface

4.3.1 Requirements induced by the synthetic data generation objectives

FLUTE’s Project Objective 3 (PO3) calls to research and develop novel methods for the generation of multi-modal synthetic health data. Using multimodal EHR data from hospitals, biopsy images and multiparametric MRI and measured variables from health data lakes, we will develop novel methods of synthetic multi-modal health data generation. Going beyond the current methods based on Generative Adversarial Networks (GANs) that generate synthetic data by converting real world data into 2-D objects, FLUTE will develop new methods that employ autoencoder structures, like Variational Auto-Encoders (VAEs) and architectures based in Diffusion models. Data exists in different formats, e.g., tabular (categorical, date, ordinal and continuous data) and biparameteric (bp) and multiparametric (mp) MRI images. When generating synthetic data, the user should be able to specify:

  • The type of data (attributes, images …) that should be generated together, e.g., “MRI image, age and diagnosis”.
  • Which subset of the distribution should be generated, e.g., “only patients with diagnosed cancer” or “only patients above a given age”.

The EHRs and parameters extracted from MRI are usually converted into 2-D objects that can be processed by Generative Adversarial Networks (GANs) to generate synthetic data: specifically for cancer patients, or medGAN and MC-medGAN for EHRs-based secure federated AI model development. In FLUTE, these will be extended to generate novel multi-modal information. Moreover, novel approaches using auto-encoder structures will be explored by constructing multi-modal datasets in the form of 1-D objects.

For bpMRI and mpMRI, the existing models are limited to the generation of 2D slices or small patches due to memory constraints. In FLUTE, new models based on GANs and Autoencoders will be developed to generate full 3D volumes. Two approaches will be considered: (a) based on the synthesis of a single parametric MRI and image-to-image translation, generate the other MRI modalities; (2) generate all modalities in a single stage while ensuring the consistency of the parametric information. Novel uses of synthetic data in FL will be applied based on recent work with insertion in FL considered to be a form of data augmentation for training local models on local and synthetic global data.

Privacy risks and legislations hinder the availability of patient data to researchers. Synthetic data has been regarded as a privacy-safe alternative to real data and has lately been employed in many research and academic endeavors. The literature shows the effectiveness of synthetic datasets for different applications in research, academics, and testing according to existing statistical and task-based utility metrics. However, the focus on longitudinal synthetic data seems deficient. Moreover, a unified metric for generic quality assessment of synthetic data is lacking.

From the domain of unbalanced or small databases, some old algorithms, such as rotation or scaling, tried to replicate a data distribution. With the explosion of deep learning, most successful approaches appear for generative models,

  • Variational Autoencoder (VAE)
  • Generative Adversarial Network (GAN)
  • Diffusion Models (DM)

A Variational Autoencoder is a type of likelihood-based generative model. It consists of an encoder, that takes in data x as input and transforms this into a latent representation z, and a decoder, that takes a latent representation z and returns a reconstruction z’. Inference is performed via variational inference to approximate the posterior of the model.

GAN uses two different networks: one that generates inputs (generator) and another one (called a discriminator) trained to distinguish generated inputs from actual ones. If at any time the discriminator is not able to notice the distinction between the two generate inputs and actual inputs representation is considered as converged. GAN can be used for unsupervised learning, supervised learning as well as for reinforcement learning. These structures are poorly convergent and required more time demanding training procedures. The common hyperparameters which are fixed during the training of the model are: Learning rate; Batch size; Number of epochs; Generator optimizer; Discriminator optimizer; Number of layers; Number of units in a dense layer; Activation function; Loss function. Usual platforms for coding, like Keras, are providing tools like Keras Tuner for GAN Hyperparameter Tuning. Diffusion models are the most recent and computationally demanding of the three proposed methods. Additionally, they are less well-studied than the previous ones. Given their relative instability as learning machines, there is a substantial body of literature dedicated to tuning these systems.

4.3.2 Requirements induced by the objectives on predictive modeling for diagnosis

In the framework of FLUTE’s csPCa prediction algorithm development, training of composite AI models will be considered. That is, complex algorithms that will work with three different types of input data will be combined to make the final decision: clinical data, biomarkers and MRI images. The development of the composite algorithms will be done based on the state-of-the-art study being carried out in WP2 and also on the conditions derived from the work of WP5.

An overview of the Machine Learning (ML) models/algorithms that will be considered is given below. It is important to note that the final choice of models and algorithms may depend on the specific data available, requirements, and ongoing research. As previously indicated, we will probably deal with data of different nature: clinical data (probably raw data or tabular data), biomarkers and MRI images. Therefore, ideally, dedicated algorithms will be used for image processing, and other types of algorithms will be used to deal with data of other formats.

Some potential options for non-image data are:

  • Logistic Regression (LR) is a widely used linear classification algorithm that models the probability of a binary outcome based on one or more predictor variables. Unlike Linear Regression, which is used to predict numerical values, logistic regression uses the logistic function to transform a linear combination of predictor variables into a value between 0 and 1, which represents the probability of the event occurring. LR input features could be clinical data and biomarkers, and the output is a probability score between 0 and 1. Training logistic regression models is relatively fast and computationally efficient, especially for medical datasets. The model size is usually small.
  • Decision Trees (DT) and Random Forests (RF). Decision Trees are non-linear models that can capture complex relationships in data. They are tree structures that divide a data set into smaller, more manageable subsets based on specific characteristics of the data, with the objective of making predictions. RFs are an ensemble of DT. In this type of algorithms, input features could be similar to LR, and the output is a binary decision (e.g. biopsy needed or not). DT training is fast, and RF provides a good balance between accuracy and complexity. Model sizes are reasonable.
  • Support Vector Machines (SVM) is a binary classification algorithm for searching a hyperplane that best separates data points. SVM input features could be similar to those of LR, DT and RF models and the output is a binary classification decision. SVM training can be computationally expensive for large datasets, and the model size is relatively small.

For MRI image processing, we will study the use of Deep Neural Networks (DNN). DNNs, including Convolutional Neural Networks (CNNs) for image data, are a type of ML algorithm that is based on the workings of the human brain and can capture complex patterns in data through multiple layers. DNNs are a subcategory of Artificial Neural Networks and are called “deep” because they have many hidden layers between the input layer and the output layer. Each hidden layer performs increasingly abstract feature transformations and extractions as the data propagates through the network.

As mentioned above, the input data for these networks could include medical images like MRI scans. The output can be a probability score or a binary decision. From these algorithms, certain characteristics of the images could also be extracted. It is worth mentioning that training DNNs can be computationally expensive, requiring a significant number of iterations and potentially large model sizes.

We will study the possibility of, a priori, using certain features extracted from the MRI images, and together with the clinical data and biomarkers, make the prediction of csPCa and the decision whether a biopsy will be needed or not. As previously mentioned, the final development will depend on what has been studied in state of the art and also on the possibilities offered by the final input data.

The use of Meta-Algorithms is also being considered. These are ML algorithms that are used to combine, adjust or manipulate other learning algorithms to improve the performance and accuracy of the final model. In the case of FLUTE, to enable retraining on data from different hospitals and help tune hyperparameters, we could consider Bayesian Optimization or Grid Search.