Requirements Federated Learning and mUlti-party computation Techniques for prostatE cancer
0.1.0 - ci-build
Funded by the European Union

Requirements Federated Learning and mUlti-party computation Techniques for prostatE cancer, published by HL7 Europe. This guide is not an authorized publication; it is the continuous build for version 0.1.0 built by the FHIR (HL7® FHIR® Standard) CI Build. This version is based on the current content of https://github.com/hl7-eu/flute-requirements/ and changes regularly. See the Directory of published versions

3.2 Main Categories

This section provides a set of data and algorithmic requirements of both the “researchers and the innovators”. There exist four main categories for these requirements:

  • Tabular data
  • Images
  • Algorithms
  • Evaluation and validation

3.2.1 Tabular Data

When developing AI algorithms, including generative models, it is crucial to know exactly what the input data looks like. It means that the format of the data should be known in order for the algorithm to read it, its configuration should be fixed so that the number of columns and their characteristics (numeric, categorical or sometimes both) is known, so that it can be processed and normalized and is statistically valid since we don’t want the input data to be corrupted. This is also important in case a user wants to add new data before the training process.

Moreover, it is vital to know the configuration and characteristics of the expected outcome so that it can be evaluated. Usually, when training a generative model in order to create synthetic data, the output configuration should be identical to the configuration of the input.

3.2.2 Images

The format of the input image data must be known, as well as the normalization of the data, if it is normalized. Also, the size (number of pixels on each axis) must be fixed and the resolution and contrast of the images should preferably be about the same for all images although, given sufficient and diverse data, it is not necessary.

3.2.3 Algorithms

Training a model is a potentially time-consuming procedure. Simple models or models that train on small data sets usually train for seconds or minutes but training more complex models using big data sets might take hours, days or even weeks in some cases. Moreover, the generation of synthetic data after the model is trained may take time – depending on the resolution, the number of channels and the algorithm itself – it could take as much as a few seconds per image generated. Therefore, the creation of hundreds or thousands of synthetic images could take minutes or even hours.

Generative models are usually more complex than most models and there might be a significant difference between different generative algorithms. For instance, Variational Auto Encoder (VAE) will usually take less time to train than Diffusion models, at the price of the synthetic image’s quality. Training a GAN is relatively fast and the image quality is better than those produced using VAE, but GANs are more challenging when it comes to setting hyperparameters and have a tendency to so-called “mode collapse” which means that the models learn how to generate only a specific type of images.

State of the art generative models, such as stable diffusion and Dall-E, are based on Diffusion models. This algorithm takes longer to train than VAEs and GANs since it treats each pixel separately and the size of the latent space is the size as the input image – in contrast to VAEs and GANs. Also, the procedure for generating a synthetic image is longer since iterative and consumes time – again, unlike VAEs and GANs where this stage only requires sampling from a known distribution and executing one forward pass in the network.

Another issue concerning this category is dealing with cases where the training process fails. For instance, choosing unsuitable hyperparameters could make the loss function that we are trying to optimize in the training process to diverge. This possibility is more likely to happen when using a GAN algorithm but is relevant to all algorithms.

3.2.4 Evaluation and validation

The processes of evaluating and validating tabular data and images can be quite different, and therefore this category will be subdivided into 2 sub-categories:

  • Image data: Training a generative model will produce synthetic data but we have no guarantee for the quality of the data. Moreover, the definition of “quality” may vary: while in some projects the quality might be defined as realism while in others, as diversity or consistency. In some cases, the definition will include the definitions mentioned above and more.

According to the definition of the meaning, a correct evaluation metric should be considered, for instance, for realism an appropriate metric might be structural similarity index (SSIM) and for fidelity the simple mean square error (MSE) might be preferred, while the Fréchet inception distance (FID) can be used for quantifying the realism and diversity of generated images. Of course, there exist other definitions and different metrics.

In our case, one of the specifications is human expert validation which means that experts should validate the synthetic images according to their expertise. In addition, there exist several statistical methods and tools to validate synthetic images. Another way to validate the synthetic image data is using it as a form of data augmentation to improve the performance of prediction models trained only with real data.

  • Tabular data: In this case, the evaluation and validation process are simpler as some of the quality measures of images do not apply to tabular data. For instance, there is no “realism” concerning tabular data and human expert validation also does not apply.

Validating synthetic tabular data is usually performed using statistical tools. Another way to evaluate tabular data is using pre-trained classifiers that have trained on real data, to make predictions on the synthetic data and then examine the results.