How to find domains on a protein sequence?

Probably you have heard about blast. It helps to find homologous sequences, it can be use through different variants with proteins, genes or just partial sequences to try to find what it is.

It can be used also for proteins but there is another tool which can compete with it: HMMER
In this article I will describe how I used it to find which kind of protein do we have.

HMMER, like blast, has a family of functionalities but it was originally thought as a tool to find homologous proteins through a Hidden Markov Model. At the date of publication has four main functionalities (hmmsearch jackhmmer hmmscan phmmer) that can be used online.

I will center in hmmsearch, the main tool, which we can run locally after downloading the software, I assume in this post that you are working with a computer with some of the Linux distributions and that you already downloaded the HMMER software and installed it. (See the manual)

Setps

We need a profile of Hidden Markov Model (pHMM [For an extended review read this article]), which we can make our own profile or we can download a profile of PFAM proteindata base here. This let us use the most up-to-date information without limiting us to a specifically domain.
Once we download it we need to uncompress (in a linux computer with $gunzip Pfam-A.hmm.gz ) and then use one of the auxiliary functionalities of hmmer to prepare it for the hmmscan. We must do $hmmpress Pfam-A.hmm

It will create a set of different files ready for the use of hmmscan.
The next step is running the main program, we can do it with $hmmscan -o .

Where the can be a single protein fasta sequence or a multifasta protein sequence. the is the profile of Hidden Markov Model in this case the Pfam-A.hmm file. And the is the name of the file where the results are printed. You can read the manual for a longer explanation.

Result

Once we get the output we can parse it with several tools like the new module of biopython wich let us parse the output file. I did my own parser that include other functionalities, as I will try to show with this file sample, which is the first entry of a query I did.This file is plain text as it was created without any formatting option.

As you can see this provide many information, and the file can be quite large. Let's focus just on the domains:

   E-value score bias    E-value score bias    exp N Model      Description
   -------    ------ -----   -------     ------ -----   ---- -- --------          -----------
   2e-95     318.2   0.6    3.3e-95 317.5   0.1    1.5 2 Bac_DnaA      Bacterial dnaA protein
   3.8e-30 103.2   0.2    9.4e-30 101.9   0.1    1.7 1 Bac_DnaA_C domain
   5.7e-12    45.0   0.0    1.1e-11    44.1   0.0    1.4 1 IstB                IstB-like ATP binding protein
   3.1e-05    23.9   0.0    8.2e-05 22.5   0.0    1.7 2 AAA               ATPase family associated with various cellular
4.7e-05   22.8   0.1    0.00015   21.2   0.0    1.9 2 Arch_ATPase   Archaeal ATPase
   0.00053   19.8   0.1    0.0017 18.1 0.1    1.8 1 RNA_helicase   RNA helicase
     0.0017   16.9   3.7     0.085    11.3 0.3    2.2 2 AFG1_ATPase AFG1-like ATPase
------ inclusion threshold ------
     0.011   15.0   0.3           3.8       7.0   0.0    2.4 2 CheC               CheC-like family
     0.038   12.3   0.2        0.058     11.7 0.2    1.3 1 Kinesin         Kinesin motor domain

This provide the main information of which domains are present. But for each domain it also provide the alignment, and it can be that there are more than one alignment for each domain. For example the Bac_DnaA it has 2 alignments, the summary of them are in the following lines:

>> Bac_DnaA Bacterial dnaA protein
#    score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc
---   ------ ----- ---------    ---------   ------- -------    ------- -------    ------- -------    ----
1 ! 317.5   0.1   2.5e-98   3.3e-95       1     218 [.     116     333 ..     116     334 .. 0.99
2 ? -2.6   0.0           2   2.6e+03      92     200 ..     380     404 ..     361     437 .. 0.55

Which have where the alignment of each begins and the e-value for each alignment.

To consider. The domains can overlap partially or completely which can mean that the longest domain already includes the shortest one.

Script

One last thing, I want to present my script in python.

It selects the best alignment by its e-value and then the longest non-overlapping domains for each query and creates a .csv file with the name of the query, the model, de description, the significance (e-value), the alignments positions and the length of the domain. It can use also the alignments under the threshold or not through the options of the script, to see them use $python hmmer_parser_2.py -h
You can found the script here on my github account

If you have any question about HMMER or the script comment below

Other information:
A more detailed comparison between blast and hmmer with examples can be found here.
Don't forget to read the manual or the web application if you don't use the HMMER scan for a genome-scale analysis and you don't want to install it in your server/computer

Bioinformatics or B101nformatics

Search This Blog