Do you ever heard about hexamer likelihood used to determine if a sequence is coding or not?
In one of my assigments on my master of Bioinformatics there was a project to use hexamers instead. The hexamers are defined with 6 nucleotides on a sequence. The idea behind using hexamers is considering that in coding regions there is an influence of the previous codon to the next codon, whereas on the intronic or non-coding regions there isn't.
The script works
need three fasta files, the first one with intronic sequences, the
second one with coding sequences, and the last one with sequences to
analyze. All files must use the IUPAC code for sequences, and on the
right frame. This is due to the way of signaling in which frame a
sequence should be, which varies between programs and descriptions.
It is also more difficult to calculate the best score for the unknown
sequence, without presenting two scores or biasing for a training
set. Assuming everything is on frame is safer and easier and more
intuitive.
We can run the program with:
$ ./hexamer_likelihood.pl
intronic.fasta coding.fasta unknown.fasta 2>hexamer_likelihood.log
The script print in
the STDERR channel information about the execution. It can be stored
in the hexamer_likelihood.log file using 2>hexamer_likelihood.log while the output is
printed on the terminal:
Coding unknown_012
395bp
Intronic unknown_016
265bp
Intronic unknown_001
56bp
Intronic unknown_006
65bp
Coding unknown_019
81bp
Coding unknown_020
735bp
Intronic unknown_002
158bp
Coding unknown_018
435bp
Intronic unknown_011
1341bp
Coding unknown_014
164bp
Intronic unknown_009
62bp
Coding unknown_015
1141bp
Coding unknown_017
91bp
Intronic unknown_003
2390bp
Intronic unknown_004
1552bp
Coding unknown_007
427bp
Intronic unknown_010
59bp
Intronic unknown_008
1205bp
Coding unknown_013
205bp
Note that the result
printed is Intronic or Coding and the identifier and description of
the sequence, the sequence order may change as it is not sorted.
To retrieve this
output the program reads first the fasta file to analyse, then the
intronic fasta file is read and the frequency of the hexamers for the
whole file is calculated, followed by the same process for the
coding.fasta file, then for each sequence of the unknown file and for
each hexamer the logarithm of the frequencies between intronic and
coding for that hexamer is calculated and added up. Then it is
divided by the number of hexamers on the sequences to obtain a score
of the sequence. With this score we classify each sequence into
Coding or Intronic.
The script has
documentation about the functions and how to use them using pod, to
access these information you can use $ perldoc deliver_final.pl
The created script
works with the files provided (2000, 2000 and 20 sequences
respectively) in 1 wallclock secs ( 1.20 usr + 0.19 sys = 1.39
CPU). It is fast due to the use of references whenever there is a
variable with large information, I.e.: between the hash of
frequencies for each hexamer in both training files, and with the
hashes with the fasta files read.
The complexity of
the algorithm depends on the max length of the sequences and the
number of sequences of the three files. If we consider the maximum
number of hexamers in the longest sequence, the number of sequences
as n, and the number indicate each file 1 intronic, 2 coding, 3 data
to analyze then the complexity of the algorithm is: O(n1*m1²+n2*m2²
+n3*m3) So in the worst case the algorithm
scales following a quadratic function.
To test it longer
some files with the first x sequences of the training set files were
created provided on the assignment keeping constant the other one.
The range were the first 100, 500, 1000, and 1500 sequences:
On the previous
image we can see how the mean time needed by the program when the
number of sequences of the files change. The files are too short to
differentiate clearly the complexity of the algorithm but we can
observe that the longer a files are, the longer time it needs.
I created a new file
half.fasta which is a mix of sequences from the coding.fasta and from
intronic.fasta, with the first 1000 sequences of each file.
Testing if it is
accurate I downloaded a file from NCBI from the RefSeq NM_206781, as the training set is from the Drosophila melanogaster, and
created three files with it, one with the complete sequence of the
gene, one with an intronic sequence, and another one with the a
coding sequence.
half intronic | intronic half | half coding | coding half | intronic coding | coding intronic | |
intronic_NM_206781 |
Intronic
|
Coding
|
Intronic
|
Coding
|
Intronic
|
Coding
|
coding_NM_206781 |
Intronic
|
Coding
|
Coding
|
Intronic
|
Coding
|
Intronic
|
complete_NM_206781 | Intronic | Coding | Coding | Intronic | Coding | Intronic |
We can see on this
table that for the training set it is not biased. If we swap the
files from their expected positions the result change accordingly.
And that with the half file we can reach the same conclusion. On the
other hand on the complete sequence it detects that there are coding
regions, but it could go undetected if the coding region is shorter
than the non-coding region.
I run the program
analyzing the own training datasets. The coding.fasta file contains
237 intronic sequences and 1755, (the remaining 8 sequences are too
short to have hexamers). The intronic.fasta file contains 55 coding
sequences according to the algorithm and 1945 intronic.