Semi-blind machine learning - ensemble learning (SML-EL)

The program vsml_statistics implements a new approach for predictive modelling based on fMRI. The main idea is to supplement fMRI data with readily available non-imaging information so that reliable predictive modeling becomes feasible even for smaller sample sizes.

The difference between vsml and vsm_statistics is that vsm_statistics investigates a number of of randomly sampled training and test set pairs so that a statistic about the accuracy of predictions can be made. The program vsml on the other hand applies SML to only one specific training and test set pair.

The input into vsml_statistics is a collection of connectomes of N subjects together with a textfile containing the target variable of interest, e.g. intelligence (one line per subject). A second textfile containing non-imaging supplementary information is also needed (also one line per subject). The number of rows in those two files must equal the number of subjects specified by the parameter ‘-nsubjects’.

The program randomly selects K non-overlapping training/test sets, where K is specified by the parameter ‘-numsamples’. The number of elements in the training sets is specified by the parameter ‘-ntrain’. Likewise, the number of elements in the training sets is specified by the parameter ‘-ntest’. Those two numbers should not exceed the total numbers of subjects N. The parameters ‘-dimX’ and ‘-npls’ are used to control the partial least squares regression (PLS), where ‘-dimX’ determines the number of features (edges of the connectome) that are inout into the PLS, while ‘-npls’ determines the number of latent components. The parameter -nensembles’ determines the number of ensembles in the ensemble learning process.

The output is a textfile with K rows (i.e. one row for each selection of training/test sets). Each such row shows the resulting linear correlation and R^2 between the predicted and the observed target variable (e.q. IQ). More precisely, each output row has 8 columns headed ‘A’,’B’,’C’,’X’,’AX’,’BX’,CX’,’alpha’. The column headed ‘A’ contains the correlation between predicted and observed IQ without using supplementary information. The column headed ‘B’ contains the correlation between predicted and observed IQ with using supplementary information, but without bias control. The column headed ‘C’ contains the correlation between predicted and observed IQ with using supplementary information, this time with bias control. The column headed ‘X’ contains the correlation between predicted and observed IQ only using supplementary information (without using fMRI data).

The column headed ‘AX’ contains information about the bias in column ‘A’, i.e. the correlation of the prediction with the supplementary information. This correlation should be approximately the same as the correlation of the observed IQ with the supplementary information. Likewise, the column headed ‘BX’ contains information about the bias in column ‘B’. And the column headed ‘CX’ contains information about the bias in column ‘C’. The column headed ‘alpha’ shows the parameter alpha after adjustment for bias control.

Example:

vsml_statistics -in func_*.v -ntrain 290 -ntest 100 -regressor IQ.txt -xx education.txt \
 -numsamples 50 -dimX 800 -npls 10 -nensembles 1000 -seed 12345 -out results.txt

Parameters of ‘vsml’:

-help

Prints usage information.

-in

Input files

-out

Output textfile

-ntrain

Number of subjects in the training set

-ntest

Number of subjects in the test set

-regressor

Textfile containing the target variable (e.g. IQ, one line per subject)

-xxx

Supplementary information (e.g. education levels, one line per subject)

-dimX

Number of features per ensemble

-npls

Number of components for PLS

-nensembles

Number of ensembles

-numsamples Number of samples -seed Seed for random number generator

Reference:

Lohmann, G. et al (2023), bioRxiv, Improving the reliability of fMRI-based predictions of intelligence via semi-blind machine learning, https://doi.org/10.1101/2023.11.03.565485