Thursday, 3 December 2015

Hexamer likelihood

Do you ever heard about hexamer likelihood used to determine if a sequence is coding or not?

In one of my assigments on my master of Bioinformatics there was a project to use hexamers instead. The hexamers are defined with 6 nucleotides on a sequence. The idea behind using hexamers is considering that in coding regions there is an influence of the previous codon to the next codon, whereas on the intronic or non-coding regions there isn't.

The script works need three fasta files, the first one with intronic sequences, the second one with coding sequences, and the last one with sequences to analyze. All files must use the IUPAC code for sequences, and on the right frame. This is due to the way of signaling in which frame a sequence should be, which varies between programs and descriptions. It is also more difficult to calculate the best score for the unknown sequence, without presenting two scores or biasing for a training set. Assuming everything is on frame is safer and easier and more intuitive.

We can run the program with:
$ ./ intronic.fasta coding.fasta unknown.fasta 2>hexamer_likelihood.log
The script print in the STDERR channel information about the execution. It can be stored in the hexamer_likelihood.log file using 2>hexamer_likelihood.log while the output is printed on the terminal:

Coding unknown_012 395bp
Intronic unknown_016 265bp
Intronic unknown_001 56bp
Intronic unknown_006 65bp
Coding unknown_019 81bp
Coding unknown_020 735bp
Intronic unknown_002 158bp
Coding unknown_018 435bp
Intronic unknown_011 1341bp
Coding unknown_014 164bp
Intronic unknown_009 62bp
Coding unknown_015 1141bp
Coding unknown_017 91bp
Intronic unknown_003 2390bp
Intronic unknown_004 1552bp
Coding unknown_007 427bp
Intronic unknown_010 59bp
Intronic unknown_008 1205bp
Coding unknown_013 205bp

Note that the result printed is Intronic or Coding and the identifier and description of the sequence, the sequence order may change as it is not sorted.
To retrieve this output the program reads first the fasta file to analyse, then the intronic fasta file is read and the frequency of the hexamers for the whole file is calculated, followed by the same process for the coding.fasta file, then for each sequence of the unknown file and for each hexamer the logarithm of the frequencies between intronic and coding for that hexamer is calculated and added up. Then it is divided by the number of hexamers on the sequences to obtain a score of the sequence. With this score we classify each sequence into Coding or Intronic.
The script has documentation about the functions and how to use them using pod, to access these information you can use $ perldoc
The created script works with the files provided (2000, 2000 and 20 sequences respectively) in 1 wallclock secs ( 1.20 usr + 0.19 sys = 1.39 CPU). It is fast due to the use of references whenever there is a variable with large information, I.e.: between the hash of frequencies for each hexamer in both training files, and with the hashes with the fasta files read.
The complexity of the algorithm depends on the max length of the sequences and the number of sequences of the three files. If we consider the maximum number of hexamers in the longest sequence, the number of sequences as n, and the number indicate each file 1 intronic, 2 coding, 3 data to analyze then the complexity of the algorithm is: O(n1*m1²+n2*m2² +n3*m3) So in the worst case the algorithm scales following a quadratic function.
To test it longer some files with the first x sequences of the training set files were created provided on the assignment keeping constant the other one. The range were the first 100, 500, 1000, and 1500 sequences:

On the previous image we can see how the mean time needed by the program when the number of sequences of the files change. The files are too short to differentiate clearly the complexity of the algorithm but we can observe that the longer a files are, the longer time it needs.
I created a new file half.fasta which is a mix of sequences from the coding.fasta and from intronic.fasta, with the first 1000 sequences of each file.
Testing if it is accurate I downloaded a file from NCBI from the RefSeq NM_206781, as the training set is from the Drosophila melanogaster, and created three files with it, one with the complete sequence of the gene, one with an intronic sequence, and another one with the a coding sequence.

half intronic intronic half half coding coding half intronic coding coding intronic
complete_NM_206781 Intronic Coding Coding Intronic Coding Intronic

We can see on this table that for the training set it is not biased. If we swap the files from their expected positions the result change accordingly. And that with the half file we can reach the same conclusion. On the other hand on the complete sequence it detects that there are coding regions, but it could go undetected if the coding region is shorter than the non-coding region.

I run the program analyzing the own training datasets. The coding.fasta file contains 237 intronic sequences and 1755, (the remaining 8 sequences are too short to have hexamers). The intronic.fasta file contains 55 coding sequences according to the algorithm and 1945 intronic.  

No comments:

Post a Comment

Thanks for your comment!
Gracias por tu comentario.
Gràcies pel teu comentari.