After a while I write back to comment about the preprint: "Benchmarking joint multi-omics dimensionality reduction approaches for cancer study". First, I'll explain why I comment here: On 20th September 2019 I saw on twitter a poster about multiomics mentioing RGCCA, which I have been using for a while. After asking somehting about it the first author of the poster commented that would notify me when the preprint would be ready. Now that the preprint is ready, the first author of the paper, Laura, asked for comments on the preprint.
Already from the abstract one notices that there is an special emphasis on reproducible research. The methods used and how other readers can use them, can be easily accessed and reused.
The integrations methods are classified after explaining why the integration methods are needed, and why do we need to compare them. One of these classifications are the dimension reduction approaches, which this article focus on. However, I do not agree with the following sentence: "Dimensionality Reduction (DR) approaches decompose the omics into a shared low-dimensional latent space". Neither of the two references provided support this claim as far as I understood them. Some dimension reduction methods decompose the omics into low dimensional (latent?) spaces that are not shared by all the omics.
1.Joint Dimensionality Reduction approaches and principles
The methods analyzed are RGCCA, MCIA, MOFA, MSFA, intNMF, iCluster, JIVE, tICA, sckit-fusion. Most of them are methods implemented in R packages, except sckit-fusion which is a python package. There are several implementations of some of these methods like JIVE which is also on the STATegRa.
The article mentions the factor matrix as shared between all omics, but the RGCCA and the MCIA method used doesn't have a shared factor matrix (F) for all the omics but have one factor matrix for each omics (Fi) instead. This is recognized as a result instead than on the background section. Also it explains that JIVE and MSFA have factors that are shared by all the omics and some that are not.
It is also mentioned that some methods require the same sample and some need the same features, while other needs both things. A possible workaround mentioned is convert all of the omics to the same symbols as the others. I think that here there is a confusion between what a feature is and how is represented. The methodology used is to use the correlation matrix of the samples.
I think it would be clearer if this first section of results were either on the background or on the method section.
2) Benchmarking joint Dimensionality Reduction approaches on simulated omics datasets
There is now a comparison of the methods using the Jaccard index. It is mentioned some of the "best-performing method[s]" on 6 different simulated datasets with 5, 10, 15 groups and with equal size of not. I think it would be clearer to describe that instead of letting the reader read the caption of the Figure 2. Also it might be worth faceting the plots, and there are some methods that are in a different plot inside the Figure, which makes it harder to read
3) Benchmarking joint Dimensionality Reduction approaches on cancer datasets
Now the same methods are applied to the TCGA dataset to each of the cancer types and the resulting weights are tested for enrichment in known biological pathways and processes.
Next they test if the factors (all 10 of them?) are related to survival. But the result depends more on the cancer type rather than the method used.
Clinical annotations are used for the integration: “age of patients,” “days to new tumor,” “gender”, and “neo-adjuvant therapy somministration". The reason is to compare the selectivity and specificity of the method.
Biological pathways and process are analyzed using reactome and gene ontology and cancer hallmarks from GSEA.
From this section it is not clear to me what were the expectations before doing the tests. What I understandd is that there the relationship between the omics should be related to these variables.
From the enrichment tests it seems to expect the selected genes should be related to the relationship between the data used: gene expression, DNA methylation, and miRNA expression and to a known and annotated function. While this might be true it is know that genes are not well annotated and that the predominant annotation comes from differential expression analysis. This makes it difficult to expect that these genes important for the relations would be annotated.
From Figure 3 (referenced in this section), I would modify the y axis. It seems that logarith in base 10 are not used because there are some values below 0. This would make it easier to understand the plot.
4) Benchmarking joint Dimensionality Reduction approaches on single-cell datasets
Now the data used is scRNA-seq and scATAC-seq, on three cancer cell lines (HTC, Hela and K562) for a total of 206 cells. The first two factors are used to decide if the methods perform well or not. It is not described which one of the factors from the omics are selected when using a method that has a factor matrix for each omic. Also the clusters are expected to be on the first two factors, instead on just the first one or the fourth.
5) Multi-omics mix (momix) Jupyter notebook
The authors provide the code for reproducing the content of the paper. Which is quite nice documented and easy to understand.
The article mentions that some of this Big DataTM "are frequently profiled from different sets of patients/samples, leading to missing data", which seems to suggests that these methods need to expand to handle missing data. My point of view is that sometimes we don't need bigger data but better data. Attempting to find the relationship between several patients, at different time points with different data from each patient won't work as well as havving some smaller data but of all the patients at all the timepoints.
The distinction between co-inertia and correlation.
The suggestion to invest on capturing non-linear signals on data is important and I really hope there are some more approaches that handle multiple omics from the same patients that are capable to identify this signals. One approach could be from kernel PCA or similar methods, but so far I haven't found any one.
Presentation of the nine jDR algorithms
A nice short easy to read and understand summary of the methods. I think that there is missing the definition of the symbol k used, it seems like the number of factors calculated.
1. Integrative Non-negative Matrix Factorization (intNMF)
2. Joint and Individual Variation Explained (JIVE)
3. Multiple co-inertia analysis (MCIA)
I must say that the original paper is a bit obscure to me. But seeing the definition here is much clearer that is the same than the RGCCA with tau = 1, and scale = TRUE and factorial scheme from the result of a dimensional reduction (like a PCA) instead of the data directly.
4. Regularized Generalized Canonical Correlation Analysis (RGCCA)
While it is well described the sparse variant for omic data is not used on the paper.
6. Multi-Omics Factor Analysis (MOFA)
7. Tensorial Independent Component Analysis (tICA)
8. Multi-Study Factor Analysis (MSFA)
9. Data fusion (scikit-fusion)
Factor selection for performance comparisons
This section explain what process was followed with the omics-specific factor. However, these differences are not discussed or used later on the article. Part of it could be on the discussion section.
The number of simulations seems low, just 1000. Also it is not clear to me if the same data-set is used for all the methods or if each method was used with different simulated data-sets. Looking at the source code provided confirms that each method is analyzed on its own simulated data. Thus, the comparison is not fair, I think it would be better to compare the same simulated datasets by all the methods. (Aside, it would be rellevant to store the seed used to generate those random dataset if you want other to create those exact same datasets).
Clustering of factor matrix
Comparing jDR algorithm clusters to ground-truth clusters
Selection of the clinical annotations
The clinical annotations are relevant. Having some other technical annotations (like vial_number, or patient_id) would be relevant to check if there is patient variation or some difference between vials processed earlier or later. This way some batch effect could be discarded.
It reminds me of the Dice or Sørensen score, maybe a more common index could be used. If the dice or the Jaccard index is used then it would be comparable to the "ground-truth" from the simulated data sets.
Testing the biological enrichment of metagenes
Here fgsea is used, which is a package I like and have contributed to in the past. With the metagenes, rows of the weight, most of the genes should have a weight of 0 which makes unstable the gsea method.
But it seems that it is also used for gene ontologies, which have an underlying structure (DAG) that affects how the terms are identified as significant. Also it is not clear if they used just some sub-ontologies like biological component or biological function or they used all the ontology.
Quality of single-cell clusters
Overall the article is a nice comparison of several methods and provides the framework to compare more multi-omic methods. I think that to better compare the methods it should be done with non-default parameters and with similar methods. Expecting that omic-specific methods perform equally than the joint-factors methods' is a bit of a stretch, explaining these in the introduction could avoid the reader a surprise when reading the methods section. The brief summary of the methods is very useful and well described.
Most of these methods have a score indicating how well did they perform on the integration, I'm not sure if this could be used to compare between the methods.
This is my first review of an article, so if you'd like to suggest some improvements on the comments let me know.