Skip to main content

Posts

RV coefficient

I have recently learned about the RV coefficient and I wanted to share it:

The RV coefficient is like a correlation between several variables. In R I found so far several methods that seems to calculate it:

FactoMineR::coeffRV
subselect::rv.coef
MatrixCorrelation::RV

However the rv.coeff function from the subselect package works with only one matrix, and the coeffRV and the RV functions differ in they results.

This lead me to search for its definition. All the papers mention Escoufier as the originator of these idea. The longest citation (and more relevant) can be found on a preprint, where it is mentioned the year, the author and that several of these papers are in French.

In this document we can find the definition which is only followed by the RV function, in the coeffRV function the values are scaled.

You can check that:

RV(scale(X, scale = FALSE), scale(Y, scale = FALSE)) == coeffRV(X, Y)$rv
Recent posts

I don't do machine learning

Yes, the title is true even if I do data science in bioinformatics, I don't do machine learning.

As seen recently if used correctly, regressions tend to work as well as machine learning. Classic tools (?) still work, I can't say I have tried all of them, but they are quite useful.

Also in bioinformatics it is hard to get a big number of samples to make both a good and reliable generalization and to train reliable a model with enough confidence.

Last, most machine learning methods are to me black boxes, I don't understand them (yet). I like to understand what I use. (Although I can't say I have deeply understood the differences between some regression methods I use).

Then, why I am writing this?

Because it seems like an hype to say things like "powerful network medicine tools", "machine learning model", without explaining them in detail. So it becomes a black box, and science is not about black boxes.
In science we want to increase the knowledge and…

Functional enrichment methods and pathways

For some time I have been working on one topic. I am not sure if this is how it started but I fail to see other reasons. So I'll describe why I'm now working with gene sets collections.

The trigger
I usually try to help others in Biostars, Bioconductor, and in the StackExchange network (specially in Bioinformatics). On one of these sites I was trying to help some person, and in one of the comments ( Jun 21 '17 ) it says:
You don't build pathway maps from bioinformatics data, you build them from wet-lab experiments. And I was : "Why not? We already know (kind of) the number of genes, and we have an idea or the number of metabolites in a cell. We have many data, why can't we build pathways?" But I did a brief literature search and I couldn't find anything (if there is something let me know in the comments).
The background
Let me explain why this comment got me puzzled: in my work I am usually asked what is the relevance of the lists of gen…

RISK cohort

Since some time I am working with Crohn's Disease. One of the problems with the disease is that it is not known what happens. People has found associations with microorganisms, but the relationship between those microorganisms and the patient is still unknown. Also the risk factors for complications is largely unknown.

This post follows up the use of a patient cohort data enrolled for identifying the risk factors of complications and health-care costs in pediatric and adult onset Crohn’s disease. Where we can see some usage of the data and the problems of unclear descriptions when using the same data.

Articles describing the RISK cohort The first mention to the RISK cohort I could found is in this article [1] where they describe a cohort as:
an observational research program that enrolled patients younger than age 17 diagnosed with in flammatory (nonpenetrating, nonstricturing) CD from 2008 through 2012 at 28 pediatric gastroenterology centers in North America. I had 552 patients …

Bioconductor histories with git-svn

If you are developing a software you might be using a version control (If not do it :). Bioconductor until 05/2017 is using svn. However it is migrating to git, meanwhile a hybrid system is provided, where one submits the project through GitHub using git control version system and internally it uses svn. Here are some experiences developing in for Bioconductor in this configuration.

After following the recommendations of the configuration .git/config ends up with:

[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
[remote "origin"]
    url = https://github.com/llrs/BioCor.git
    fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
    remote = origin
    merge = refs/heads/master
[remote "bioc"]
    url = https://github.com/Bioconductor-mirror/BioCor.git
    fetch = +refs/heads/*:refs/remotes/bioc/*
[svn-remote "devel"]
    url = https://hedgehog.fhcrc.org/bioconductor//trunk/madman/…

GSEA in Bioconductor

Gene Set Enrichment Analysis is a test thought to find if the position of a group along a list implies some difference. The most know method is the one maintained by the Broad Institute. As it was the first widely used in biology and holds several collection of gene sets. A gene set is a collection of genes related, by either a function or an experiment, it is as fuzzy described as a pathway.

In Bioconductor there is the under used tool of BiocViews, a topic for package classificacion. We can find a category for GSEAs under Software>BiologicalQuestion>GeneSetEnrichment.

This category list 74 packages at the time of writing, which provide function for Gene Set Enrichment Analysis. It will be too long (and too hard for me) to describe all the packages in that category. However, it doesn't include all the packages that perform gene set enrichment.

The first package for GSEA in Bioconductor one should look is GSEABase which provides with tools for reading files from the Br…

BioCor: My first package in Bioconductor

Yesterday I received an amazing email:

Congratulations, BioCor has been added to Bioconductor!
Yes, I had submitted a package for the Bioconductor project at the beginning of the week.

The package calculates similarities between pathways, genes and clusters of genes based on their pathways. A pathway is a group of functionally related proteins, thus this similarities calculates the functional similarity of the pathway or genes in question.


If anyone is curious what the email had this was in the body (I didn't know what to expect when I knew that it would be accepted):


 Hi Lluís,

Congratulations, BioCor has been added to Bioconductor!
Currently, the definitive location for your Bioconductor package is
in our SVN repository. The following information is to help you in
your role as a package maintainer. You’ll need the following
credentials to maintain your package:

Subversion user ID: myuser
Password: mypassword

Package ‘landing pages’

Every package in Bioconductor gets its own land…