Barcodes for genomes and applications
Fengfeng Zhou, et al. BMC Bioinformatics
Whether the genomes have a unique signature across all the genomes is a question that the answer remains unknown. This paper analyzes the existed 586 sequenced prokaryotic genomes by calculating k-mer, 1<k<6, frequency distribution. They divide genome to equal length of sequence and calculate the k-mer frequency of each sequence, then transferring the abundance of occurrence for each k-mer into gray-level scale to form the barcode. They have found four major points when observing the data. Firstly, the frequency of 4-mer is quite stable in the genome by observing the vertical bands. Secondly, there is a small difference on a small fraction of the genome and these differences typically present special classes of genes. Thirdly, different chromosomes within species will have majority similar part and a little different part of pattern. Finally, the barcode similarity can directly infer the phylogenetic distance. They use Markov chain as random nucleotide sequence generating model, and the third and above order Markov chain show the similar pattern as the real data. We also can differentiate coding and non-coding region by observing different pattern in the barcode. They also apply this method to eukaryotic, mitochondrial, plastid and plasmid genomes. In high-level eukaryotic genome, different region have similar backbone barcode and introns and promoter regions have low complexity compare with repetitive and coding region. The mitochondrial and plastid genomes don’t have similar pattern compare with other genomes. By above simulation they conclude different classes of genomes have their own unique characteristic. In the prokaryotic genomes analysis result the abnormal region tend to be horizontal gene transfers, phage invasions and highly expressed genes. The total amount of 30% have fall into these three categories, and the remaining 70% of abnormal sequences are still remain secret. They use the barcode similarity to differentiate the sequence into different bin. They use 11 random select genomes from different species but same genus, and another 30 and 100 random selected bacterial genera genomes. They use 500, 1000, 2000, 5000 and 10000 fragment size to test the boundary of the software accuracy. The result shows over 1000 bp will have more than 50% of accuracy in three sample sets. The barcode pattern only present on the genome sequence not in random sequence. This barcode method can provide us a global view of genomes and its presentation is easily to compare the distribution between genome and genome.
This research is major apply on prokaryotic, and the conclusion is address from the test set. Therefore we cannot have a strong hypothesis infer to other genomes. The boundary of the accuracy for this software is around 1000 bp. The NST is around 400 bp for each reads. It seems not suitable for using this barcode scheme on short reads.






Recent Comments