Feb 27,2009
Objective:In this part, we are trying to compare among three algorithms:ORF finder, Glimmer and ECgnfinder.We are trying to find the possible genes in H. eryhtrogramma and H. tuberculata.We have used purple sea urchin as our model organism.Our second goal is to find the coverage for each of the two species.
??
About algorithms:ECgnfinder is based on the concept of log-likelihood ratio between codon usage model and random model.The output from ECgnfinder will give the log-likelihood ratio for each of six open reading frames.?The first step will be to train the algorithm with known gene sequences.Parameters for this model are the codon usage (expressed in probability) and probabilities of each of four nucleotides.This algorithm takes care of unidentified codons and always take the genes with proper start codons (ATG, GTG and TTG) and stop codons (TAG, TAA and TGA).Below is an excerpt of the ecgnfinder output.
ecgnfinder output screenshot
ORF finder compares among six open reading frames based on the length.It also gives the start and end position of possible gene sequence.
ORF Finder screenshot
Glimmer also gives the most probable ORFs with starting and ending positions.Glimmer SCORE
Methodology
Here are the steps to create the training sets for ECgnfinder and Glimmer.

Flow Chart
Results:
Data analysis:We have calculated percentage agreement and percentage of disagreement among the algorithms based on the blast output.
The ECgnfinder algorithm is based on the codon usage of each species.Below are the codon usage table for H. erythrogramma and H. tuberculata:
| TTT Phe 18.51 0.45 TTC Phe 22.96 0.55 TTA Leu 11.87 0.13 TTG Leu 14.45 0.16 |
TCT Ser 14.60 0.21 TCC Ser 11.79 0.17 TCA Ser 15.85 0.22 TCG Ser 4.76 0.07 |
TAT Tyr 13.82 0.48 TAC Tyr 15.15 0.52 TAA Stp 1.64 0.21 TAG Stp 1.72 0.22 |
TGT Cys 11.56 0.56 TGC Cys 9.14 0.44 TGA Stp 4.29 0.56 TGG Trp 13.12 1.00 |
| CTT Leu 17.26 0.19 CTC Leu 15.62 0.18 CTA Leu 11.63 0.13 CTG Leu 17.88 0.20 |
CCT Pro 13.27 0.29 CCC Pro 10.23 0.22 CCA Pro 17.33 0.38 CCG Pro 5.08 0.11 |
CAT His 15.85 0.59 CAC His 10.85 0.41 CAA Gln 17.26 0.44 CAG Gln 21.71 0.56 |
CGT Arg 6.01 0.11 CGC Arg 5.08 0.09 CGA Arg 5.31 0.10 CGG Arg 4.14 0.08 |
| ATT Ile 16.94 0.32 ATC Ile 25.30 0.48 ATA Ile 10.23 0.19 ATG Met 30.14 1.00 |
ACT Thr 14.68 0.25 ACC Thr 16.01 0.27 ACA Thr 21.55 0.37 ACG Thr 6.56 0.11 |
AAT Asn 19.44 0.49 AAC Asn 20.30 0.51 AAA Lys 29.05 0.44 AAG Lys 36.78 0.56 |
AGT Ser 13.12 0.19 AGC Ser 10.46 0.15 AGA Arg 20.15 0.37 AGG Arg 13.90 0.25 |
| GTT Val 16.40 0.24 GTC Val 18.04 0.27 GTA Val 15.93 0.24 GTG Val 16.94 0.25 |
GCT Ala 22.88 0.35 GCC Ala 15.30 0.23 GCA Ala 22.72 0.35 GCG Ala 4.92 0.07 |
GAT Asp 28.97 0.57 GAC Asp 22.25 0.43 GAA Glu 34.59 0.53 GAG Glu 30.92 0.47 |
GGT Gly 17.33 0.26 GGC Gly 11.79 0.18 GGA Gly 26.63 0.40 GGG Gly 10.07 0.15 |
Codon Usage Table(H. erythrogramma)
| TTT Phe 18.55 0.48 TTC Phe 20.11 0.52 TTA Leu 11.27 0.13 TTG Leu 14.73 0.17 |
TCT Ser 15.25 0.21 TCC Ser 10.05 0.14 TCA Ser 18.55 0.25 TCG Ser 3.29 0.04 |
TAT Tyr 15.25 0.52 TAC Tyr 14.04 0.48 TAA Stp 1.91 0.26 TAG Stp 1.73 0.24 |
TGT Cys 9.01 0.50 TGC Cys 8.84 0.50 TGA Stp 3.64 0.50 TGG Trp 9.88 1.00 |
| CTT Leu 15.77 0.19 CTC Leu 14.56 0.17 CTA Leu 10.40 0.12 CTG Leu 18.03 0.21 |
CCT Pro 14.39 0.31 CCC Pro 9.88 0.22 CCA Pro 17.51 0.38 CCG Pro 4.16 0.09 |
CAT His 16.47 0.61 CAC His 10.57 0.39 CAA Gln 19.59 0.44 CAG Gln 24.79 0.56 |
CGT Arg 6.07 0.10 CGC Arg 3.81 0.06 CGA Arg 5.03 0.09 CGG Arg 4.33 0.07 |
| ATT Ile 16.81 0.35 ATC Ile 19.93 0.41 ATA Ile 11.61 0.24 ATG Met 26.52 1.00 |
ACT Thr 12.13 0.23 ACC Thr 12.13 0.23 ACA Thr 24.96 0.46 ACG Thr 4.68 0.09 |
AAT Asn 21.67 0.48 AAC Asn 23.05 0.52 AAA Lys 33.28 0.43 AAG Lys 44.72 0.57 |
AGT Ser 13.52 0.18 AGC Ser 12.65 0.17 AGA Arg 25.48 0.43 AGG Arg 14.39 0.24 |
| GTT Val 15.95 0.26 GTC Val 14.56 0.24 GTA Val 14.73 0.24 GTG Val 16.29 0.26 |
GCT Ala 21.32 0.38 GCC Ala 12.13 0.21 GCA Ala 18.89 0.33 GCG Ala 4.33 0.08 |
GAT Asp 31.72 0.58 GAC Asp 22.71 0.42 GAA Glu 36.40 0.51 GAG Glu 34.84 0.49 |
GGT Gly 17.16 0.26 GGC Gly 12.83 0.19 GGA Gly 23.57 0.35 GGG Gly 13.52 0.20 |
Codon Usage Table(H. tuberculata)




Agreement results:
H. erythrogramma:

| Sample size | 537 | Mean | 125.86 |
| Standard | 55.395 | ||
H. erythrogramma:
We have tried to find the length distribution of the contigs based on the fact that long length has a good chance of being a gene.

| Sample size | 1524 | Mean | 370.51 |
| Standard | 198.913 | ||
We did the coverage analysis two ways.
- 1. We used Lander-Waterman model to estimate the coverage.We have considered the genome size of sea urchin approximately equal to 814 megabase (Science 10 November 2006 Vol 314 no 5801, pp 941-952).
Coverage =( average length of reads * number of reads)/genome size
For erythrogramma: 0.0065x
For tuberculata: 0.0063x
If we want 1x coverage for each specie then we will need approximately 1.3 million reads.
- 2. Another way to calculate the coverage is to use Poisson’s distribution:
We can consider parameter λ as the average number of reads in mRNA and y is the number of times a particular read has been sequenced.From the first part of our project we know,
For erythrogramma: λ 3 (2.8) [ this means there are 3 ESTs on average are sequenced into contigs].If we want to see each read to be sequenced 4 times then about 17% of ESTs will be sequenced 4 times onto contig.This number will be 22% for tuberculata.
Reference:
- Sea urchin Forkhead gene family: Phylogeny and embryonic expression
- The Genome of the Sea Urchin Strongylocentrotus purpuratus
- [Genome-announce] Purple sea urchin assembly in Genome Browser
- Lander-Waterman Statistics for Shotgun Sequencing
- Lander-Waterman Model
Contributions:
|
Generate training set,ECGenFinder:
|
Nathan Nehrt |
|
Report Writing, Presentation, and data analyse:
|
Indrani Sarkar |
|
Website Page Update, programming( blast running script, blast paser, ORF running script ORF paser), result arrangement:
|
Chuan-Yih Yu |
|
Glimmer,Parsing its output, and analysis:
|
Sashikiran Challa |









Recent Comments