Fig. 2

Overview of the machine learning workflow used in the project, highlighting key steps. Step 1 removes invalid data, outliers, and features with exercise missing values. Step 2 imputes data through MICE or minimum imputation. Step 3 utilizes ComBat, removing the batch effect. Step 4 partitions the data into five folds before the feature selection phase of step 5 . This is done to reduce the common risk of data leakage [36], as performing the k-fold partitioning after the feature selection would result in testing on previously seen data points. Step 6 augments synthetic data through SMOTE. Steps 7, 8, and 9 include optimizing hyperparameters, training, and evaluating the models. Steps 10 and 11 involve evaluating the results and potential biomarkers suggested by the models