Saturday, 10 March 2018

Functional enrichment methods and pathways

For some time I have been working on one topic. I am not sure if this is how it started but I fail to see other reasons. So I'll describe why I'm now working with gene sets collections.

The trigger

I usually try to help others in Biostars, Bioconductor, and in the StackExchange network (specially in Bioinformatics). On one of these sites I was trying to help some person, and in one of the comments ( Jun 21 '17 ) it says:
You don't build pathway maps from bioinformatics data, you build them from wet-lab experiments.
And I was : "Why not? We already know (kind of) the number of genes, and we have an idea or the number of metabolites in a cell. We have many data, why can't we build pathways?" But I did a brief literature search and I couldn't find anything (if there is something let me know in the comments).


The background

Let me explain why this comment got me puzzled: in my work I am usually asked what is the relevance of the lists of genes? What are these genes doing?  My first reaction is using some of those wonderful tools of functional enrichment. Most (all?) of these tools use three methods:
  •  over representation analysis (ORA), 
  • functional class scoring (FCS), and 
  • pathway topology (PT)
Wonderfully explained and compared here, and summarized in this figure:

Functional pathways analysis methods. From: Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges, Purvesh Khatri Marina Sirota Atul J. Butte

I lend towards using ORA methods with gene ontologies, with topGO, (See this other post), and pathways from Reactome, with clusterProfiler or ReactomePA. And I also use fgsea with the pathways or Reactome or from the list of gene sets from the Broad Institute (NOT with gene ontologies, due to their relationship between them).

As you can see all these methods tend to use a list of predefined genes associated with a category. But some time ago before that comment I found that the data I (and I would say many other packages and users) relied upon it, was incorrect.


The adventure

Then I started to think about how do these pathways databases work. There are many, and with some differences between them. "Which is the most reliable? Which one should I use? Which is most up to date"

I asked for advice when using those enrichment tools but they say "It is okay what you already do, you can also use IPA". We got interesting results. But I was unsure of my methods looked for some comparisons between databases and I couldn't find them.

I realized, that comparing those databases is not trivial. They use different schemes, naming conventions, has different aims... so I thought, "That's why I can't find comparisons, it is too hard to do it." Although KEGG and REACTOME are the most established, I couldn't find a way to compare the pathways and the databases to select the most accurate one.

Some time later (in December 2017) when I was at a symposium presenting BioCor a friend asked me if there was some established pathway databases and if the software in BioCor handled approximations and errors. My poor answer was:

No, it takes garbage in, and garbage out. There are too many differences between databases

But then I realized that the information that I use from those databases is not if there is a physical connection between two different proteins or the type of connection they have, like that a protein negatively regulates another. I just use the list of proteins. I would only need that information if I used TP methods (and even then depending on which method I wouldn't need to know more than the list of genes that are in a pathway) .

So if I only use the list of genes or proteins the pathways database are similar to this extend and should be able to analyze that aspect of the database. I took some more time digging for comparisons of pathways but all I could find are how have they evolved, or improved from previous versions. I couldn't find a method to compare the databases by looking at the list of genes in a pathway. (If I overlooked some studies, please let me know)

After I started working in a package to analyse those databases I found (27/02/2018) this work explaining that a gene set can be deemed as related to a trait but that the question is about if:
it is more related to survival than random sets of genes.
So besides to compare databases, I (we) would need methods to compare pathways against random sets of genes. See this figure:
Modified from Figure 1 from
Most random gene expression signatures are significantly associated with breast cancer outcome
Are pathways random sets of genes or are they really non randomly selected? This is the only study I could find that instead of focusing in the methods focus on the data used for those methods. It presents the idea that we should test against random sets of genes. So we should also be able to generate them, but this made me think about if there is some "hidden" information in the databases, if the relationships between genes follow some kind of distribution of pattern.


The outcome

So I am now creating a package with methods to compare databases, create random gene sets and learn more from those lists of genes about the relationship between them. It is still on the early phase but comments and feedback are welcomed.

Saturday, 28 October 2017

RISK cohort

Since some time I am working with Crohn's Disease. One of the problems with the disease is that it is not known what happens. People has found associations with microorganisms, but the relationship between those microorganisms and the patient is still unknown. Also the risk factors for complications is largely unknown.

This post follows up the use of a patient cohort data enrolled for identifying the risk factors of complications and health-care costs in pediatric and adult onset Crohn’s disease. Where we can see some usage of the data and the problems of unclear descriptions when using the same data.

Articles describing the RISK cohort

The first mention to the RISK cohort I could found is in this article [1] where they describe a cohort as:
an observational research program that enrolled patients younger than age 17 diagnosed with in flammatory (nonpenetrating, nonstricturing) CD from 2008 through 2012 at 28 pediatric gastroenterology centers in North America.
I had 552 patients in that article of 2014, it doesn't provide where to find the data. It is described as a previous report in this study of 2017 [2]. It states that there are 1813 patients enrolled from which 913 had diagnosed with Crohn’s disease, with complete information on disease location, without complications at diagnosis, and attend the follow-up visits. (See this comment for some criticism on the mentioned article, and this other comment about the composition of the cohort [3,4].)

One article that uses these data describe it on the first figure [5]. As you can see on that figure the RISK cohort, according to this article, has 322 samples. In the body of the article the reference of the origin of the data as following article [6]. Where the cohort is described as:
The RISK cohort. Ileal biopsy samples and associated clinical information were obtained from the RISK study, an ongoing, prospective observational IBD inception cohort sponsored by the Crohn’s and Colitis Foundation of America. 1,656 children and adolescents younger than 17 years, newly diagnosed with IBD and non-IBD Ctls, were enrolled at 28 North American pediatric gastroenterology centers between 2008 and 2012. All patients were required to undergo baseline colonoscopy and confirmation of characteristic chronic active colitis/ileitis by histology prior to diagnosis and treatment, with the recording of findings in standardized fashion. Only subjects with a confirmed persisting diagnosis of CD, UC, or Ctl during an average of 22 months follow-up to date were included in this analysis, which included a representative subgroup of age-matched CD (n = 243), Ctl (n = 43), and disease Ctl UC (n = 73) patients.
Note that between [2] and [6] there is some difference in the way to describe the data. It might seem that from those 1813 patients 1656 where younger than 17 years. But few seem to have a persisting diagnosticiated CD  for 2 years, because the CD patients are reduced to 243.

In another referenced article[7] in [6] we can read about the RISK cohort that :
A total of 447 children and adolescents (< 17 years) with newly diagnosed CD and a control population composed of 221 subjects with non-inflammatory conditions of the gastrointestinal tract were enrolled to the RISK study in 28 participating pediatric gastroenterology centers in North America between November 2008 and January 2012
Which disagrees with  the previous article about the total number of patients with less than 17 years with CD (226 instead of 243) maybe because they used a more restrictive subset of the cohort in the average time of follow up.

A different study[8] referenced [2] and [6-7] and describes the cohort as:
The RISK study is an observational prospective cohort study that aims to develop risk models for predicting complicated course in children with Crohn's disease. From 2008 to 2012, the RISK study recruited more than 1,800 treatment-naive patients with a suspected diagnosis of Crohn's disease at 28 pediatric gastroenterology centers in North America.
However they use 245 samples with ileal CD, but 35 lacked gut inflammation and were classified as non-IBD controls. Remaining 210 selected individuals showed persisting Crohn's disease and remained in complication-free B1 status for at least 90 d from the time of initial diagnosis. After 3 years of follow-up, 27 had a complicated disease course with progressions to further states B2 or B3.

I can't understand how from those 1656 patients described in [6] it end up with 322 patients in [7] instead of 243 patients with Crohn disease as in [6]. Also, it is not clear how in [6] we have 243 patients while the origin of data seems to be [7] where only 226 are described. And from [8] we learn that 245 had ileal CD which could mean that all the patients described in [6] and [7] could be from ileal samples. Furthermore, there isn't a reference between [6-7] and [2], which could mean that they are different cohort of patients despite being from the same (?) 28 centers in North America and being enrolled in the same time (November 2008 and January 2012). As this is unlikely, there is a lack of description on how they processed the same cohort of patients.

This doubts drifted my interest to find the actual data where all these articles are based on, totally or partially.

Availability of the RISK cohort

The first article describing the RISK study doesn't describe a location where to find the data, neither the more recent article [2].

Interestingly, Peters et al. ([5]) despite providing links to other datasets used they don't provide a link or a reference were to find the RISK cohort. Indicating perhaps that it is not freely available or that there are other problems providing the data to the scientific community.

Haberman et al. ([6]) link to a repository in the Gene Expression Omnibus GSE57945. However in that dataset instead of the total 359 samples selected, there are 322 samples listed which match the total number of samples described in [5] but does not match the total number of samples selected in the original study of the cohort [2] nor their own total number of samples.

Gevers, et al. ([7]) only provide references about the 16S projects not about the RISK RNA-seq expression data used.

Marigorta et al. in [8] provide another link to another data set in the Gene Expression Ominbus, the GSE93624 data set, which has "210 treatment-naïve patients of pediatric Crohn's disease and 35 non-IBD controls from the RISK study."

From the original 913 patients with Crohn's disease at most 532 samples are made public, if the IDs of the patients in those two datasets are not the same.  The GSE57945 was upload on 2014 but last updated on 2017 and provide more information than the GSE93624. I couldn't find a way to make sure if the same patient has samples in both datasets.


I don't know where I can find the whole RISK cohort. Maybe more description of the process used with the datasets would be helpful to clarify what is the RISK cohort. It seems clear that the GSE93624 and the gene set GSE57945 are both involved in that cohort but lack of the whole data set hinders replicability.

Update 24/01/2018: 
Found a web page of the RISK cohort here, describing up to 2013 the state of the project.


  1. Walters, Thomas D., et al. "Increased effectiveness of early therapy with anti-tumor necrosis factor-α vs an immunomodulator in children with Crohn's disease." Gastroenterology 146.2 (2014): 383-391.
  2. Kugathasan, Subra, et al. "Prediction of complicated disease course for children newly diagnosed with Crohn's disease: a multicentre inception cohort study." The Lancet 389.10080 (2017): 1710-1718.
  3. Arijs, Ingrid, and Isabelle Cleynen. "RISK stratification in paediatric Crohn's disease." The Lancet 389.10080 (2017): 1672-1674.
  4.  Kugathasan, Subra, Lee A. Denson, and Jeffrey S. Hyams. "Exclusive and partial enteral nutrition for Crohn's disease–Authors' reply." The Lancet 390.10101 (2017): 1486-1487.
  5. Peters, Lauren A., et al. "A functional genomics predictive network model identifies regulators of inflammatory bowel disease." Nature genetics 49.10 (2017): 1437.
  6. et al. Pediatric Crohn disease patients exhibit specific ileal transcriptome and microbiome signature. J. Clin. Invest. 124, 3617–3633 (2014).
  7. Gevers, Dirk, et al. "The treatment-naive microbiome in new-onset Crohn’s disease." Cell host & microbe 15.3 (2014): 382-392.
  8. Marigorta, Urko M., et al. "Transcriptional risk scores link GWAS to eQTLs and predict complications in Crohn's disease." Nature genetics 49.10 (2017): 1517.


Friday, 19 May 2017

Bioconductor histories with git-svn

If you are developing a software you might be using a version control (If not do it :). Bioconductor until 05/2017 is using svn. However it is migrating to git, meanwhile a hybrid system is provided, where one submits the project through GitHub using git control version system and internally it uses svn. Here are some experiences developing in for Bioconductor in this configuration.

After following the recommendations of the configuration .git/config ends up with:

    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
[remote "origin"]
    url =
    fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
    remote = origin
    merge = refs/heads/master
[remote "bioc"]
    url =
    fetch = +refs/heads/*:refs/remotes/bioc/*
[svn-remote "devel"]
    url =
    fetch = :refs/remotes/git-svn-devel
[svn-remote "release-3.5"]
    url =
    fetch = :refs/remotes/git-svn-release-3.5
[branch "release-3.5"]
    remote = bioc
    merge = refs/heads/release-3.5
[branch "devel"]
    remote = bioc
    merge = refs/heads/master

This configuration creates a devel branch forked from Github Bioconductor's mirror, which is equivalent to the devel trunk in svn.
However when I develop my package I do so in master branch which creates the hassle to bring (merge or cherry-pick) the changes (commits) from master to devel branch for later release on Bioconductor or for hot fix in a release-* branch.

Also make sure to try if you have the permissions to write on the repository in Bioconductor