Saturday, 28 October 2017

RISK cohort

Since some time I am working with Crohn's Disease. One of the problems with the disease is that it is not known what happens. People has found associations with microorganisms, but the relationship between those microorganisms and the patient is still unknown. Also the risk factors for complications is largely unknown.

This post follows up the use of a patient cohort data enrolled for identifying the risk factors of complications and health-care costs in pediatric and adult onset Crohn’s disease. Where we can see some usage of the data and the problems of unclear descriptions when using the same data.

Articles describing the RISK cohort

The first mention to the RISK cohort I could found is in this article [1] where they describe a cohort as:
an observational research program that enrolled patients younger than age 17 diagnosed with in flammatory (nonpenetrating, nonstricturing) CD from 2008 through 2012 at 28 pediatric gastroenterology centers in North America.
I had 552 patients in that article of 2014, it doesn't provide where to find the data. It is described as a previous report in this study of 2017 [2]. It states that there are 1813 patients enrolled from which 913 had diagnosed with Crohn’s disease, with complete information on disease location, without complications at diagnosis, and attend the follow-up visits. (See this comment for some criticism on the mentioned article, and this other comment about the composition of the cohort [3,4].)

One article that uses these data describe it on the first figure [5]:
An integrative approach for constructing a predictive network model of IBD, and identifying and validating master regulators of these networks.
Figure 1: An integrative approach for constructing a predictive network model of IBD, and identifying and validating master regulators of these networks.

As you can see on that figure the RISK cohort, according to this article, has 322 samples. In the body of the article the reference of the origin of the data as following article [6]. Where the cohort is described as:
The RISK cohort. Ileal biopsy samples and associated clinical information were obtained from the RISK study, an ongoing, prospective observational IBD inception cohort sponsored by the Crohn’s and Colitis Foundation of America. 1,656 children and adolescents younger than 17 years, newly diagnosed with IBD and non-IBD Ctls, were enrolled at 28 North American pediatric gastroenterology centers between 2008 and 2012. All patients were required to undergo baseline colonoscopy and confirmation of characteristic chronic active colitis/ileitis by histology prior to diagnosis and treatment, with the recording of findings in standardized fashion. Only subjects with a confirmed persisting diagnosis of CD, UC, or Ctl during an average of 22 months follow-up to date were included in this analysis, which included a representative subgroup of age-matched CD (n = 243), Ctl (n = 43), and disease Ctl UC (n = 73) patients.
Note that between [2] and [6] there is some difference in the way to describe the data. It might seem that from those 1813 patients 1656 where younger than 17 years. But few seem to have a persisting diagnosticiated CD  for 2 years, because the CD patients are reduced to 243.

In another referenced article[7] in [6] we can read about the RISK cohort that :
A total of 447 children and adolescents (< 17 years) with newly diagnosed CD and a control population composed of 221 subjects with non-inflammatory conditions of the gastrointestinal tract were enrolled to the RISK study in 28 participating pediatric gastroenterology centers in North America between November 2008 and January 2012
Which disagrees with  the previous article about the total number of patients with less than 17 years with CD (226 instead of 243) maybe because they used a more restrictive subset of the cohort in the average time of follow up.

A different study[8] referenced [2] and [6-7] and describes the cohort as:
The RISK study is an observational prospective cohort study that aims to develop risk models for predicting complicated course in children with Crohn's disease. From 2008 to 2012, the RISK study recruited more than 1,800 treatment-naive patients with a suspected diagnosis of Crohn's disease at 28 pediatric gastroenterology centers in North America.
However they use 245 samples with ileal CD, but 35 lacked gut inflammation and were classified as non-IBD controls. Remaining 210 selected individuals showed persisting Crohn's disease and remained in complication-free B1 status for at least 90 d from the time of initial diagnosis. After 3 years of follow-up, 27 had a complicated disease course with progressions to further states B2 or B3.

I can't understand how from those 1656 patients described in [6] it end up with 322 patients in [7] instead of 243 patients with Crohn disease as in [6]. Also, it is not clear how in [6] we have 243 patients while the origin of data seems to be [7] where only 226 are described. And from [8] we learn that 245 had ileal CD which could mean that all the patients described in [6] and [7] could be from ileal samples. Furthermore, there isn't a reference between [6-7] and [2], which could mean that they are different cohort of patients despite being from the same (?) 28 centers in North America and being enrolled in the same time (November 2008 and January 2012). As this is unlikely, there is a lack of description on how they processed the same cohort of patients.

This doubts drifted my interest to find the actual data where all these articles are based on, totally or partially.

Availability of the RISK cohort

The first article describing the RISK study doesn't describe a location where to find the data, neither the more recent article [2].

Interestingly, Peters et al. ([5]) despite providing linkds to other datasets used they don't provide a link or a reference were to find the RISK cohort. Indicating perhaps that it is not freely available or that there are other problems providing the data to the scientific community.

Haberman et al. ([6]) link to a repository in the Gene Expression Omnibus GSE57945. However in that dataset instead of the total 359 samples selected, there are 322 samples listed which match the total number of samples described in [5] but does not match the total number of samples selected in the original study of the cohort [2] nor  their own total number of samples.

Gevers, et al. ([7]) only provide references about the 16S projects not about the RISK RNA-seq expression data used.

Marigorta et al. in [8] provide another link to another data set in the Gene Expression Ominbus, the GSE93624 data set, which has "210 treatment-naïve patients of pediatric Crohn's disease and 35 non-IBD controls from the RISK study."

From the original 913 patients with Crohn's disease at most 532 samples are made public, if the IDs of the patients in those two datasets are not the same.  The GSE57945 was upload on 2014 but last updated on 2017 and provide more information than the GSE93624. I couldn't find a way to make sure if the same patient has samples in both datasets.


I don't know where I can find the whole RISK cohort. Maybe more description of the process used with the datasets would be helpful to clarify what is the RISK cohort. It seems clear that the GSE93624 and the gene set GSE57945 are both involved in that cohort but lack of the whole data set hinders replicability.


  1. Walters, Thomas D., et al. "Increased effectiveness of early therapy with anti-tumor necrosis factor-α vs an immunomodulator in children with Crohn's disease." Gastroenterology 146.2 (2014): 383-391.
  2. Kugathasan, Subra, et al. "Prediction of complicated disease course for children newly diagnosed with Crohn's disease: a multicentre inception cohort study." The Lancet 389.10080 (2017): 1710-1718.
  3. Arijs, Ingrid, and Isabelle Cleynen. "RISK stratification in paediatric Crohn's disease." The Lancet 389.10080 (2017): 1672-1674.
  4.  Kugathasan, Subra, Lee A. Denson, and Jeffrey S. Hyams. "Exclusive and partial enteral nutrition for Crohn's disease–Authors' reply." The Lancet 390.10101 (2017): 1486-1487.
  5. Peters, Lauren A., et al. "A functional genomics predictive network model identifies regulators of inflammatory bowel disease." Nature genetics 49.10 (2017): 1437.
  6. et al. Pediatric Crohn disease patients exhibit specific ileal transcriptome and microbiome signature. J. Clin. Invest. 124, 36173633 (2014).
  7. Gevers, Dirk, et al. "The treatment-naive microbiome in new-onset Crohn’s disease." Cell host & microbe 15.3 (2014): 382-392.
  8. Marigorta, Urko M., et al. "Transcriptional risk scores link GWAS to eQTLs and predict complications in Crohn's disease." Nature genetics 49.10 (2017): 1517.


Friday, 19 May 2017

Bioconductor histories with git-svn

If you are developing a software you might be using a version control (If not do it :). Bioconductor until 05/2017 is using svn. However it is migrating to git, meanwhile a hybrid system is provided, where one submits the project through GitHub using git control version system and internally it uses svn. Here are some experiences developing in for Bioconductor in this configuration.

After following the recommendations of the configuration .git/config ends up with:

    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
[remote "origin"]
    url =
    fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
    remote = origin
    merge = refs/heads/master
[remote "bioc"]
    url =
    fetch = +refs/heads/*:refs/remotes/bioc/*
[svn-remote "devel"]
    url =
    fetch = :refs/remotes/git-svn-devel
[svn-remote "release-3.5"]
    url =
    fetch = :refs/remotes/git-svn-release-3.5
[branch "release-3.5"]
    remote = bioc
    merge = refs/heads/release-3.5
[branch "devel"]
    remote = bioc
    merge = refs/heads/master

This configuration creates a devel branch forked from Github Bioconductor's mirror, which is equivalent to the devel trunk in svn.
However when I develop my package I do so in master branch which creates the hassle to bring (merge or cherry-pick) the changes (commits) from master to devel branch for later release on Bioconductor or for hot fix in a release-* branch.

Also make sure to try if you have the permissions to write on the repository in Bioconductor

Friday, 31 March 2017

GSEA in Bioconductor

Gene Set Enrichment Analysis is a test thought to find if the position of a group along a list implies some difference. The most know method is the one maintained by the Broad Institute. As it was the first widely used in biology and holds several collection of gene sets. A gene set is a collection of genes related, by either a function or an experiment, it is as fuzzy described as a pathway.

In Bioconductor there is the under used tool of BiocViews, a topic for package classificacion. We can find a category for GSEAs under Software>BiologicalQuestion>GeneSetEnrichment.

This category list 74 packages at the time of writing, which provide function for Gene Set Enrichment Analysis. It will be too long (and to hard for me) to describe all the packages in that category. However, it doesn't include all the packages that perform gene set enrichment.

The first package for GSEA in Bioconductor one should look is GSEABase which provides with tools for reading files from the Broad Institute and translating the Ids of those gene sets.

There are several types of enrichment analysis (EA or simply enrichment), which can be classified by the null hypothesis, between if it is self contained or not, if it uses phenotype, so if it is supervised or unsupervised, and depending on the unit of the enrichment score, if it is for each sample or for all the samples.

And we could further classify them by if they take into account the relationship between the genes, if they take into account the relationship between the gene sets.

I would like to highlight some packages from Bioconductor performing GSEA: limma, GSAR, GSVA, piano, fgsea, and topGO.

From limma I would like to highlight that some of the functions it provides for GSEA are corrected by correlation of expression of the genes in the gene set. The functions are mroast, roast, fry, camera and romer. barcodeplot is the function for plotting the enrichment in that package.

From GSAR package is interesting because most of the methods to do GSEA are graph/network based, interesting functions: WWtest, KStest, MDtest, RKStest, RMDtest, AggrFtest and GSNCAtest. Also the function plotMST2.pathway which allows to visualize network of the Gene sets is interesting.

From GSVA package is interesting the gsva function, which allows to use several methods as zScore, PLAGE and it's own method gsva.

piano package has implemented in R the same algorithm as the one in the Broad Institute and several other methods in the function runGSA.

From fgsea package I highlight the speed of fgsea function and the plotEnrichment function to represent it.

From topGO I highlight that is the one that takes better advantatge of the structure of gene ontologies but it has several bugs (I am trying to improve it here).

Other interesting packages are GAGE, anamiR, PGSEA, EGSEA, GSEAlm, GOseq, SigPathway, ReactomePA, Meshes, EWCE.