Skip to main content

Reviewing a preprint


After a while I write back to comment about the preprint: "Benchmarking joint multi-omics dimensionality reduction approaches for cancer study". First, I'll explain why I comment here: On 20th September 2019 I saw on twitter a poster about multiomics mentioing RGCCA, which I have been using for a while. After asking somehting about it the first author of the poster commented that would notify me when the preprint would be ready. Now that the preprint is ready, the first author of the paper, Laura, asked for comments on the preprint.


Abstract

Already from the abstract one notices that there is an special emphasis on reproducible research. The methods used and how other readers can use them, can be easily accessed and reused.

Background

The integrations methods are classified after explaining why the integration methods are needed, and why do we need to compare them.  One of these classifications are the dimension reduction approaches, which this article focus on. However, I do not agree with the following sentence: "Dimensionality Reduction (DR) approaches decompose the omics into a shared low-dimensional latent space". Neither of the two references provided support this claim as far as I understood them. Some dimension reduction methods decompose the omics into low dimensional (latent?) spaces that are not shared by all the omics.


Results

1.Joint Dimensionality Reduction approaches and principles

The methods analyzed are RGCCA, MCIA, MOFA, MSFA, intNMF, iCluster, JIVE, tICA, sckit-fusion. Most of them are methods implemented in R packages, except sckit-fusion which is a python package. There are several implementations of some of these methods like JIVE which is also on the STATegRa.


The article mentions the factor matrix as shared between all omics, but the RGCCA and the MCIA method used doesn't have a shared factor matrix (F) for all the omics but have one factor matrix for each omics (Fi) instead. This is recognized as a result instead than on the background section. Also it explains that JIVE and MSFA have factors that are shared by all the omics and some that are not.

It is also mentioned that some methods require the same sample and some need the same features, while other needs both things. A possible workaround mentioned is convert all of the omics to the same symbols as the others. I think that here there is a confusion between what a feature is and how is represented. The methodology used is to use the correlation matrix of the samples.

I think it would be clearer if this first section of results were either on the background or on the method section.

2)  Benchmarking  joint  Dimensionality  Reduction  approaches  on  simulated omics datasets

There is now a comparison of the methods using the Jaccard index. It is mentioned some of the "best-performing method[s]" on 6 different simulated datasets with 5, 10, 15 groups and with equal size of not. I think it would be clearer to describe that instead of letting the reader read the caption of the Figure 2. Also it might be worth faceting the plots, and there are some methods that are in a different plot inside the Figure, which makes it harder to read

3)   Benchmarking   joint   Dimensionality   Reduction   approaches   on   cancer datasets

Now the same methods are applied to the TCGA dataset to each of the cancer types and the resulting weights are tested for enrichment in known biological pathways and processes.
Next they test if the factors (all 10 of them?) are related to survival. But the result depends more on the cancer type rather than the method used.

Clinical annotations are used for the integration: “age of patients,” “days to new tumor,” “gender”, and “neo-adjuvant therapy somministration". The reason is to compare the selectivity and specificity of the method.


Biological pathways and process are analyzed using reactome and gene ontology and cancer hallmarks from GSEA. 

From this section it is not clear to me what were the expectations before doing the tests. What I understandd is that there the  relationship between the omics should be related to these variables.

From the enrichment tests it seems to expect the selected genes should be related to the relationship between the data used: gene expression, DNA methylation, and miRNA expression and to a known and annotated function. While this might be true it is know that genes are not well annotated and that the predominant annotation comes from differential expression analysis. This makes it difficult to expect that these genes important for the relations would be annotated.

From Figure 3 (referenced in this section), I would modify the y axis. It seems that logarith in base 10 are not used because there are some values below 0. This would make it easier to understand the plot. 

4)  Benchmarking  joint  Dimensionality  Reduction  approaches  on  single-cell datasets

Now the data used is scRNA-seq and scATAC-seq, on three cancer cell lines  (HTC,  Hela  and  K562)  for  a  total  of  206  cells. The first two factors are used to decide if the methods perform well or not. It is not described which one of the factors from the omics are selected when using a method that has a factor matrix for each omic. Also the clusters are expected to be on the first two factors, instead on just the first one or the fourth.

5) Multi-omics mix (momix) Jupyter notebook

The authors provide the code for reproducing the content of the paper. Which is quite nice documented and easy to understand.

Discussion

The article mentions that some of this Big DataTM "are frequently profiled from different  sets  of  patients/samples,  leading  to  missing  data", which seems to suggests that these methods need to expand to handle missing data. My point of view is that sometimes we don't need bigger data but better data. Attempting to find the relationship between several patients, at different time points with different data from each patient won't work as well as havving some smaller data but of all the patients at all the timepoints.

The distinction between co-inertia and correlation.

The suggestion to invest on capturing non-linear signals on data is important and I really hope there are some more approaches that handle multiple omics from the same patients that are capable to identify this signals.  One approach could be from kernel PCA or similar methods, but so far I haven't found any one.

Methods

Presentation of the nine jDR algorithms

A nice short easy to read and understand summary of the methods. I think that there is missing the definition of the symbol k used, it seems like the number of factors calculated.

1. Integrative Non-negative Matrix Factorization (intNMF)
2. Joint and Individual Variation Explained (JIVE)
3. Multiple co-inertia analysis (MCIA)
I must say that the original paper is a bit obscure to me. But seeing the definition here is much clearer that is the same than the RGCCA with tau = 1, and scale  = TRUE and factorial scheme from the result of a dimensional reduction (like a PCA) instead of the data directly. 
4. Regularized Generalized Canonical Correlation Analysis (RGCCA)
While it is well described the sparse variant for omic data is not used on the paper.
5. iCluster
6. Multi-Omics Factor Analysis (MOFA)
7. Tensorial Independent Component Analysis (tICA)
8. Multi-Study Factor Analysis (MSFA)
9. Data fusion (scikit-fusion)

Factor selection for performance comparisons

This section explain what process was followed with the omics-specific factor.  However, these differences are not discussed or used later on the article. Part of it could be on the discussion section.

Data simulation

The number of simulations seems low, just 1000. Also it is not clear to me if the same data-set is used for all the methods or if each method was used with different simulated data-sets. Looking at the source code provided confirms that each method is analyzed on its own simulated data. Thus, the comparison is not fair, I think it would be better to compare the same simulated datasets by all the methods. (Aside, it would be rellevant to store the seed used to generate those random dataset if you want other to create those exact same datasets).

Clustering of factor matrix
Comparing jDR algorithm clusters to ground-truth clusters


Selection of the clinical annotations

The clinical annotations are relevant. Having some other technical annotations (like vial_number, or patient_id) would be relevant to check if there is patient variation or some difference between vials processed earlier or later. This way some batch effect could be discarded.

Selectivity score

It reminds me of the Dice or Sørensen score, maybe a more common index could be used. If the dice or the Jaccard index is used then it would be comparable to the "ground-truth" from the simulated data sets.

Testing the biological enrichment of metagenes

Here fgsea is used, which is a package I like and have contributed to in the past. With the metagenes, rows of the weight, most of the genes should have a weight of 0 which makes unstable the gsea method.
But it seems that it is also used for gene ontologies, which have an underlying structure (DAG) that affects how the terms are identified as significant. Also it is not clear if they used just some sub-ontologies like biological component or biological function or they used all the ontology.

Quality of single-cell clusters

                                                                                                                       

Summary


Overall the article is a nice comparison of several methods and provides the framework to compare more multi-omic methods. I think that to better compare the methods it should be done with non-default parameters and with similar methods. Expecting that omic-specific methods perform equally than the joint-factors methods' is a bit of a stretch, explaining these in the introduction could avoid the reader a surprise when reading the methods section. The brief summary of the methods is very useful and well described.
Most of these methods have a score indicating how well did they perform on the integration, I'm not sure if this could be used to compare between the methods.

This is my first review of an article, so if you'd like to suggest some improvements on the comments let me know.

Comments

  1. Thanks for all your feedback, they are truly appreciated. We provide below point-by-point responses to your different comments:

    Abstract
    Already from the abstract one notices that there is an special emphasis on reproducible research. The methods used and how other readers can use them, can be easily accessed and reused.
    Reply: Yes! Indeed we are particularly interested in reproducibility because we expect many more methods and datasets to be published in the upcoming years.

    Background
    The integrations methods are classified after explaining why the integration methods are needed, and why do we need to compare them. One of these classifications are the dimension reduction approaches, which this article focus on. However, I do not agree with the following sentence: "Dimensionality Reduction (DR) approaches decompose the omics into a shared low-dimensional latent space". Neither of the two references provided support this claim as far as I understood them. Some dimension reduction methods decompose the omics into low dimensional (latent?) spaces that are not shared by all the omics.
    Reply: This is correct. We wanted to summarize at the beginning of the manuscript the general behavior of jDR methods. The details about shared, mixed and omics-specific behaviors are detailed afterward in the manuscript.

    Results

    1.Joint Dimensionality Reduction approaches and principles
    The methods analyzed are RGCCA, MCIA, MOFA, MSFA, intNMF, iCluster, JIVE, tICA, sckit-fusion. Most of them are methods implemented in R packages, except sckit-fusion which is a python package. There are several implementations of some of these methods like JIVE which is also on the STATegRa. The article mentions the factor matrix as shared between all omics, but the RGCCA and the MCIA method used doesn't have a shared factor matrix (F) for all the omics but have one factor matrix for each omics (Fi) instead. This is recognized as a result instead than on the background section. Also it explains that JIVE and MSFA have factors that are shared by all the omics and some that are not. It is also mentioned that some methods require the same sample and some need the same features, while other needs both things. A possible workaround mentioned is convert all of the omics to the same symbols as the others. I think that here there is a confusion between what a feature is and how is represented. The methodology used is to use the correlation matrix of the samples. I think it would be clearer if this first section of results were either on the background or on the method section.
    Reply: We agree that for a classical paper, such a section could be in the methods. However, as we present here a benchmark, the general description and choice of the methods is already a result. This organization is similar to other benchmark papers.

    2) Benchmarking joint Dimensionality Reduction approaches on simulated omics datasets
    There is now a comparison of the methods using the Jaccard index. It is mentioned some of the "best-performing method[s]" on 6 different simulated datasets with 5, 10, 15 groups and with equal size of not. I think it would be clearer to describe that instead of letting the reader read the caption of the Figure 2. Also it might be worth faceting the plots, and there are some methods that are in a different plot inside the Figure, which makes it harder to read
    Reply: We separated each plot into two in order to highlight the methods having an implemented internal clustering and those that do not. Indeed, it's not completely fair to compare them directly as we had to implement a different protocol for the non-clustering approaches.

    ReplyDelete
  2. 3) Benchmarking joint Dimensionality Reduction approaches on cancer datasets
    Now the same methods are applied to the TCGA dataset to each of the cancer types and the resulting weights are tested for enrichment in known biological pathways and processes. Next they test if the factors (all 10 of them?) are related to survival. But the result depends more on the cancer type rather than the method used. Clinical annotations are used for the integration: “age of patients,” “days to new tumor,” “gender”, and “neo-adjuvant therapy somministration". The reason is to compare the selectivity and specificity of the method. Biological pathways and process are analyzed using reactome and gene ontology and cancer hallmarks from GSEA. From this section it is not clear to me what were the expectations before doing the tests. What I understandd is that there the relationship between the omics should be related to these variables.
    From the enrichment tests it seems to expect the selected genes should be related to the relationship between the data used: gene expression, DNA methylation, and miRNA expression and to a known and annotated function. While this might be true it is know that genes are not well annotated and that the predominant annotation comes from differential expression analysis. This makes it difficult to expect that these genes important for the relations would be annotated.

    Reply: At the beginning, we were expecting this test to find the jDR methods that maximize the number of clinical annotations and pathways, GOs and Hallmarks associated to its factors. But we then realized that, at the same time we do not want to have the same factor associated to multiple different annotations. If this happens, the factor is « mixed » and it could not disentangle the activity of a single process in a new dataset.


    From Figure 3 (referenced in this section), I would modify the y axis. It seems that logarith in base 10 are not used because there are some values below 0. This would make it easier to understand the plot.
    Reply: The y-axis is indeed logarithm in base 10 of P-values. Due to Bonferroni multiple testing correction, some values can become bigger than 1, leading to negative values.

    4) Benchmarking joint Dimensionality Reduction approaches on single-cell datasets
    Now the data used is scRNA-seq and scATAC-seq, on three cancer cell lines (HTC, Hela and K562) for a total of 206 cells. The first two factors are used to decide if the methods perform well or not. It is not described which one of the factors from the omics are selected when using a method that has a factor matrix for each omic. Also the clusters are expected to be on the first two factors, instead on just the first one or the fourth.
    Reply: Whenever we deal with RGCCA and MCIA, we use only the factor matrix associated with transcriptomic data. We explain this point in the methods section « Factor selection for performance comparisons » The use of the first 2 factors comes from the fact that there is a hierarchy among the factors (like in PCA), and the first two should be the most relevant ones.

    5) Multi-omics mix (momix) Jupyter notebook
    The authors provide the code for reproducing the content of the paper. Which is quite nice documented and easy to understand.

    Discussion
    The article mentions that some of this Big DataTM "are frequently profiled from different sets of patients/samples, leading to missing data", which seems to suggests that these methods need to expand to handle missing data. My point of view is that sometimes we don't need bigger data but better data. Attempting to find the relationship between several patients, at different time points with different data from each patient won't work as well as havving some smaller data but of all the patients at all the timepoints.

    ReplyDelete
  3. Methods
    Presentation of the nine jDR algorithms
    A nice short easy to read and understand summary of the methods. I think that there is missing the definition of the symbol k used, it seems like the number of factors calculated.
    Reply: k is indeed the number of factors, we now have clarified it in the methods.
    3. Multiple co-inertia analysis (MCIA)
    I must say that the original paper is a bit obscure to me. But seeing the definition here is much clearer that is the same than the RGCCA with tau = 1, and scale = TRUE and factorial scheme from the result of a dimensional reduction (like a PCA) instead of the data directly.
    4. Regularized Generalized Canonical Correlation Analysis (RGCCA)
    While it is well described the sparse variant for omic data is not used on the paper.
    Reply: we did not compare the performances of the jDR algorithms based on their feature selection ability. Through the notebook is anyway possible to extend the comparison to SGCCA in case of interest.

    Factor selection for performance comparisons
    This section explain what process was followed with the omics-specific factor. However, these differences are not discussed or used later on the article. Part of it could be on the discussion section.
    Reply: we mention it in the Discussion once talking about suggestions to developers

    Data simulation
    The number of simulations seems low, just 1000. Also it is not clear to me if the same data-set is used for all the methods or if each method was used with different simulated data-sets. Looking at the source code provided confirms that each method is analyzed on its own simulated data. Thus, the comparison is not fair, I think it would be better to compare the same simulated datasets by all the methods. (Aside, it would be rellevant to store the seed used to generate those random dataset if you want other to create those exact same datasets).
    Reply: We always run all the jDR methods on the same datasets. This is also what we perform in momix.

    Clustering of factor matrix
    Comparing jDR algorithm clusters to ground-truth clusters
    Selection of the clinical annotations
    The clinical annotations are relevant. Having some other technical annotations (like vial_number, or patient_id) would be relevant to check if there is patient variation or some difference between vials processed earlier or later. This way some batch effect could be discarded.
    Reply: we selected all the clinical annotations available for more than 9 cancer types. We do agree that many other clinical annotations would be relevant, but for the sake of comparison, we decided here not to use them.

    Selectivity score
    It reminds me of the Dice or Sørensen score, maybe a more common index could be used. If the dice or the Jaccard index is used then it would be comparable to the "ground-truth" from the simulated data sets.
    Reply: The selectivity score we propose intends to measure how frequently one-to-one associations between annotations and factors could be identified. We checked the score that you propose, but it looks like an alternative to the Jaccard-index, which is used to compare intersections between 2 sets (ground-truth vs inferred clustering). We did not find pre-existing indices addressing the problem of testing a one-to-one association, but we might have missed relevant literature.

    Testing the biological enrichment of metagenes
    Here fgsea is used, which is a package I like and have contributed to in the past. With the metagenes, rows of the weight, most of the genes should have a weight of 0 which makes unstable the gsea method.
    But it seems that it is also used for gene ontologies, which have an underlying structure (DAG) that affects how the terms are identified as significant. Also it is not clear if they used just some sub-ontologies like biological component or biological function or they used all the ontology.
    Reply: For GO we used the signatures present in MsigDB, which selects only the lower level of the GO hierarchy.

    ReplyDelete

Post a comment

Thanks for your comment!
Gracias por tu comentario.
Gràcies pel teu comentari.

Popular posts from this blog

Functional enrichment methods and pathways