February 2012
S M T W T F S
« Aug    
 1234
567891011
12131415161718
19202122232425
26272829  

Categories

Friday, 18th of December 2009 at 01:46:27 PM

Report2

Feb 27,2009

Objective:In this part, we are trying to compare among three algorithms:ORF finder, Glimmer and ECgnfinder.We are trying to find the possible genes in H. eryhtrogramma and H. tuberculata.We have used purple sea urchin as our model organism.Our second goal is to find the coverage for each of the two species.

??

About algorithms:ECgnfinder is based on the concept of log-likelihood ratio between codon usage model and random model.The output from ECgnfinder will give the log-likelihood ratio for each of six open reading frames.?The first step will be to train the algorithm with known gene sequences.Parameters for this model are the codon usage (expressed in probability) and probabilities of each of four nucleotides.This algorithm takes care of unidentified codons and always take the genes with proper start codons (ATG, GTG and TTG) and stop codons (TAG, TAA and TGA).Below is an excerpt of the ecgnfinder output.

ecgnfinder output screenshot

ORF finder compares among six open reading frames based on the length.It also gives the start and end position of possible gene sequence.

ORF Finder screenshot

Glimmer also gives the most probable ORFs with starting and ending positions.Glimmer SCORE

Methodology

Here are the steps to create the training sets for ECgnfinder and Glimmer.

Flow Chart

Results:

Data analysis:We have calculated percentage agreement and percentage of disagreement among the algorithms based on the blast output.

The ECgnfinder algorithm is based on the codon usage of each species.Below are the codon usage table for H. erythrogramma and H. tuberculata:

TTT Phe 18.51 0.45
TTC Phe 22.96 0.55
TTA Leu 11.87 0.13
TTG Leu 14.45 0.16
TCT Ser 14.60 0.21
TCC Ser 11.79 0.17
TCA Ser 15.85 0.22
TCG Ser 4.76 0.07
TAT Tyr 13.82 0.48
TAC Tyr 15.15 0.52
TAA Stp 1.64 0.21
TAG Stp 1.72 0.22
TGT Cys 11.56 0.56
TGC Cys 9.14 0.44
TGA Stp 4.29 0.56
TGG Trp 13.12 1.00
CTT Leu 17.26 0.19
CTC Leu 15.62 0.18
CTA Leu 11.63 0.13
CTG Leu 17.88 0.20
CCT Pro 13.27 0.29
CCC Pro 10.23 0.22
CCA Pro 17.33 0.38
CCG Pro 5.08 0.11
CAT His 15.85 0.59
CAC His 10.85 0.41
CAA Gln 17.26 0.44
CAG Gln 21.71 0.56
CGT Arg 6.01 0.11
CGC Arg 5.08 0.09
CGA Arg 5.31 0.10
CGG Arg 4.14 0.08
ATT Ile 16.94 0.32
ATC Ile 25.30 0.48
ATA Ile 10.23 0.19
ATG Met 30.14 1.00
ACT Thr 14.68 0.25
ACC Thr 16.01 0.27
ACA Thr 21.55 0.37
ACG Thr 6.56 0.11
AAT Asn 19.44 0.49
AAC Asn 20.30 0.51
AAA Lys 29.05 0.44
AAG Lys 36.78 0.56
AGT Ser 13.12 0.19
AGC Ser 10.46 0.15
AGA Arg 20.15 0.37
AGG Arg 13.90 0.25
GTT Val 16.40 0.24
GTC Val 18.04 0.27
GTA Val 15.93 0.24
GTG Val 16.94 0.25
GCT Ala 22.88 0.35
GCC Ala 15.30 0.23
GCA Ala 22.72 0.35
GCG Ala 4.92 0.07
GAT Asp 28.97 0.57
GAC Asp 22.25 0.43
GAA Glu 34.59 0.53
GAG Glu 30.92 0.47
GGT Gly 17.33 0.26
GGC Gly 11.79 0.18
GGA Gly 26.63 0.40
GGG Gly 10.07 0.15

Codon Usage Table(H. erythrogramma)

TTT Phe 18.55 0.48
TTC Phe 20.11 0.52
TTA Leu 11.27 0.13
TTG Leu 14.73 0.17
TCT Ser 15.25 0.21
TCC Ser 10.05 0.14
TCA Ser 18.55 0.25
TCG Ser 3.29 0.04
TAT Tyr 15.25 0.52
TAC Tyr 14.04 0.48
TAA Stp 1.91 0.26
TAG Stp 1.73 0.24
TGT Cys 9.01 0.50
TGC Cys 8.84 0.50
TGA Stp 3.64 0.50
TGG Trp 9.88 1.00
CTT Leu 15.77 0.19
CTC Leu 14.56 0.17
CTA Leu 10.40 0.12
CTG Leu 18.03 0.21
CCT Pro 14.39 0.31
CCC Pro 9.88 0.22
CCA Pro 17.51 0.38
CCG Pro 4.16 0.09
CAT His 16.47 0.61
CAC His 10.57 0.39
CAA Gln 19.59 0.44
CAG Gln 24.79 0.56
CGT Arg 6.07 0.10
CGC Arg 3.81 0.06
CGA Arg 5.03 0.09
CGG Arg 4.33 0.07
ATT Ile 16.81 0.35
ATC Ile 19.93 0.41
ATA Ile 11.61 0.24
ATG Met 26.52 1.00
ACT Thr 12.13 0.23
ACC Thr 12.13 0.23
ACA Thr 24.96 0.46
ACG Thr 4.68 0.09
AAT Asn 21.67 0.48
AAC Asn 23.05 0.52
AAA Lys 33.28 0.43
AAG Lys 44.72 0.57
AGT Ser 13.52 0.18
AGC Ser 12.65 0.17
AGA Arg 25.48 0.43
AGG Arg 14.39 0.24
GTT Val 15.95 0.26
GTC Val 14.56 0.24
GTA Val 14.73 0.24
GTG Val 16.29 0.26
GCT Ala 21.32 0.38
GCC Ala 12.13 0.21
GCA Ala 18.89 0.33
GCG Ala 4.33 0.08
GAT Asp 31.72 0.58
GAC Asp 22.71 0.42
GAA Glu 36.40 0.51
GAG Glu 34.84 0.49
GGT Gly 17.16 0.26
GGC Gly 12.83 0.19
GGA Gly 23.57 0.35
GGG Gly 13.52 0.20

Codon Usage Table(H. tuberculata)

Agreement results:

H. erythrogramma:

Sample size 537 Mean 125.86
Standard 55.395

H. erythrogramma:

We have tried to find the length distribution of the contigs based on the fact that long length has a good chance of being a gene.

Sample size 1524 Mean 370.51
Standard 198.913

We did the coverage analysis two ways.

  1. 1. We used Lander-Waterman model to estimate the coverage.We have considered the genome size of sea urchin approximately equal to 814 megabase (Science 10 November 2006 Vol 314 no 5801, pp 941-952).

Coverage =( average length of reads * number of reads)/genome size

For erythrogramma: 0.0065x

For tuberculata: 0.0063x

If we want 1x coverage for each specie then we will need approximately 1.3 million reads.

  1. 2. Another way to calculate the coverage is to use Poisson’s distribution:

We can consider parameter λ as the average number of reads in mRNA and y is the number of times a particular read has been sequenced.From the first part of our project we know,

For erythrogramma: λ 3 (2.8) [ this means there are 3 ESTs on average are sequenced into contigs].If we want to see each read to be sequenced 4 times then about 17% of ESTs will be sequenced 4 times onto contig.This number will be 22% for tuberculata.

Reference:

Contributions:

Generate training set,ECGenFinder:
Nathan Nehrt
Report Writing, Presentation, and data analyse:
Indrani Sarkar
Website Page Update, programming( blast running script, blast paser, ORF running script ORF paser), result arrangement:
Chuan-Yih Yu
Glimmer,Parsing its output, and analysis:
Sashikiran Challa

Related posts