Skip to main content

Posts

Showing posts from December, 2015

Hexamer likelihood

Do you ever heard about hexamer likelihood used to determine if a sequence is coding or not? In one of my assigments on my master of Bioinformatics there was a project to use hexamers instead. The hexamers are defined with 6 nucleotides on a sequence. The idea behind using hexamers is considering that in coding regions there is an influence of the previous codon to the next codon, whereas on the intronic or non-coding regions there isn't. The script works need three fasta files, the first one with intronic sequences, the second one with coding sequences, and the last one with sequences to analyze. All files must use the IUPAC code for sequences, and on the right frame. This is due to the way of signaling in which frame a sequence should be, which varies between programs and descriptions. It is also more difficult to calculate the best score for the unknown sequence, without presenting two scores or biasing for a training set. Assuming everything is on frame is safer and easi