Functional enrichment methods and pathways

For some time I have been working on one topic. I am not sure if this is how it started but I fail to see other reasons. So I'll describe why I'm now working with gene sets collections.

The trigger

I usually try to help others in Biostars, Bioconductor, and in the StackExchange network (specially in Bioinformatics). On one of these sites I was trying to help some person, and in one of the comments ( Jun 21 '17 ) it says:

You don't build pathway maps from bioinformatics data, you build them from wet-lab experiments.

And I was : "Why not? We already know (kind of) the number of genes, and we have an idea or the number of metabolites in a cell. We have many data, why can't we build pathways?" But I did a brief literature search and I couldn't find anything (if there is something let me know in the comments).

The background

Let me explain why this comment got me puzzled: in my work I am usually asked what is the relevance of the lists of genes? What are these genes doing? My first reaction is using some of those wonderful tools of functional enrichment. Most (all?) of these tools use three methods:

over representation analysis (ORA),
functional class scoring (FCS), and
pathway topology (PT)

Wonderfully explained and compared here, and summarized in this figure:

Functional pathways analysis methods. From: Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges, Purvesh Khatri Marina Sirota Atul J. Butte

I lend towards using ORA methods with gene ontologies, with topGO, (See this other post), and pathways from Reactome, with clusterProfiler or ReactomePA. And I also use fgsea with the pathways or Reactome or from the list of gene sets from the Broad Institute (NOT with gene ontologies, due to their relationship between them).

As you can see all these methods tend to use a list of predefined genes associated with a category. But some time ago before that comment I found that the data I (and I would say many other packages and users) relied upon it, was incorrect.

The adventure

Then I started to think about how do these pathways databases work. There are many, and with some differences between them. "Which is the most reliable? Which one should I use? Which is most up to date"

I asked for advice when using those enrichment tools but they say "It is okay what you already do, you can also use IPA". We got interesting results. But I was unsure of my methods looked for some comparisons between databases and I couldn't find them.

I realized, that comparing those databases is not trivial. They use different schemes, naming conventions, has different aims... so I thought, "That's why I can't find comparisons, it is too hard to do it." Although KEGG and REACTOME are the most established, I couldn't find a way to compare the pathways and the databases to select the most accurate one.

Some time later (in December 2017) when I was at a symposium presenting BioCor a friend asked me if there was some established pathway databases and if the software in BioCor handled approximations and errors. My poor answer was:

No, it takes garbage in, and garbage out. There are too many differences between databases

But then I realized that the information that I use from those databases is not if there is a physical connection between two different proteins or the type of connection they have, like that a protein negatively regulates another. I just use the list of proteins. I would only need that information if I used TP methods (and even then depending on which method I wouldn't need to know more than the list of genes that are in a pathway) .

So if I only use the list of genes or proteins the pathways database are similar to this extend and should be able to analyze that aspect of the database. I took some more time digging for comparisons of pathways but all I could find are how have they evolved, or improved from previous versions. I couldn't find a method to compare the databases by looking at the list of genes in a pathway. (If I overlooked some studies, please let me know)

After I started working in a package to analyse those databases I found (27/02/2018) this work explaining that a gene set can be deemed as related to a trait but that the question is about if:

it is more related to survival than random sets of genes.

So besides to compare databases, I (we) would need methods to compare pathways against random sets of genes. See this figure:

Modified from Figure 1 from

Most random gene expression signatures are significantly associated with breast cancer outcome

Are pathways random sets of genes or are they really non randomly selected? This is the only study I could find that instead of focusing in the methods focus on the data used for those methods. It presents the idea that we should test against random sets of genes. So we should also be able to generate them, but this made me think about if there is some "hidden" information in the databases, if the relationships between genes follow some kind of distribution of pattern.

The outcome

So I am now creating a package with methods to compare databases, create random gene sets and learn more from those lists of genes about the relationship between them. It is still on the early phase but comments and feedback are welcomed.

Bioinformatics or B101nformatics

Search This Blog