Semi-blind machine learning - ensemble learning (SML-EL)
-----------------------------------------------------------

The program **vsml_statistics** implements a new approach for predictive modelling based on fMRI.
The main idea is to supplement fMRI data with readily available non-imaging information so 
that reliable predictive modeling becomes feasible even for smaller sample sizes.

The difference between *vsml* and *vsm_statistics* is that *vsm_statistics* investigates a number of
of randomly sampled training and test set pairs so that a statistic about the accuracy of predictions
can be made. The program *vsml* on the other hand applies SML to only one specific training and
test set pair.

The input into **vsml_statistics** is a collection of connectomes of N subjects together with a
textfile containing the target variable of interest, e.g. intelligence (one line per subject).
A second textfile containing non-imaging supplementary information is also needed (also one line per subject).
The number of rows in those two files must equal the number of subjects specified by the parameter '-nsubjects'.

The program randomly selects K non-overlapping training/test sets, where K is specified by the parameter '-numsamples'.
The number of elements in the training sets is specified by the parameter '-ntrain'.
Likewise, the number of elements in the training sets is specified by the parameter '-ntest'.
Those two numbers should not exceed the total numbers of subjects N.
The parameters '-dimX' and '-npls' are used to control the partial least squares regression (PLS),
where '-dimX' determines the number of features (edges of the connectome) that are inout into the PLS,
while '-npls' determines the number of latent components.
The parameter -nensembles' determines the number of ensembles in the ensemble learning process.


The output is a textfile with K rows (i.e. one row for each selection of training/test sets).
Each such row shows the resulting linear correlation and R^2 between the predicted and the
observed target variable (e.q. IQ). More precisely, each output row has 8 columns headed
'A','B','C','X','AX','BX',CX','alpha'.
The column headed 'A' contains the correlation between predicted and observed IQ without using
supplementary information.
The column headed 'B' contains the correlation between predicted and observed IQ with using
supplementary information, but without bias control.
The column headed 'C' contains the correlation between predicted and observed IQ with using
supplementary information, this time with bias control.
The column headed 'X' contains the correlation between predicted and observed IQ only using
supplementary information (without using fMRI data).

The column headed 'AX' contains information about the bias in column 'A',
i.e. the correlation of the prediction with the supplementary information. This correlation
should be approximately the same as the correlation of the observed IQ with the supplementary information.
Likewise, the column headed 'BX' contains information about the bias in column 'B'.
And the column headed 'CX' contains information about the bias in column 'C'.
The column headed 'alpha' shows the parameter alpha after adjustment for bias control.


Example:
^^^^^^^^^^^^

 :: 
 
   vsml_statistics -in func_*.v -ntrain 290 -ntest 100 -regressor IQ.txt -xx education.txt \
    -numsamples 50 -dimX 800 -npls 10 -nensembles 1000 -seed 12345 -out results.txt


Parameters of 'vsml':
^^^^^^^^^^^^^^^^^^^^^^^

 -help       Prints usage information.
 -in         Input files
 -out        Output textfile
 -ntrain     Number of subjects in the training set
 -ntest      Number of subjects in the test set
 -regressor  Textfile containing the target variable (e.g. IQ, one line per subject)
 -xxx        Supplementary information (e.g. education levels, one line per subject)
 -dimX       Number of features per ensemble
 -npls       Number of components for PLS
 -nensembles   Number of ensembles
 -numsamples Number of samples
 -seed       Seed for random number generator


.. index:: vsml_statistics


Reference:
^^^^^^^^^^^^^^^^^^^^^^

 Lohmann, G. et al (2023), bioRxiv, Improving the reliability of fMRI-based predictions of intelligence via semi-blind machine learning, https://doi.org/10.1101/2023.11.03.565485