February 2012
S M T W T F S
« Aug    
 1234
567891011
12131415161718
19202122232425
26272829  

Categories

Tuesday, 16th of February 2010 at 12:15:51 PM

ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences

ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences

Yijun Sun, et al. Nucleic Acids Research

This paper proposes a new method to classified operational taxonomic units (OTUs) in a large number of sequences sample. The goal of this paper is to develop a rapid, accurate and can handle large scale data for metagenomics researchers to estimate species richness. They first compare two different alignment approach, multiple sequences alignment (MSA) which is commonly use in previous study and pairwise sequences alignment (PSA), and show compatible result between MSA and PSA. They claim using PSA can have better calculation performance and more accurate result than MSA. The advantage of using PSA is problem set can be divided into multiple subsets than do the parallel computing. The full strategy of ESPRIT is as follows, removing low quality reads, computing pairwise distance, assigning sequences into OUTs and Statistical inference of species richness. First, the program will remove those reads reach one of the several thresholds such as reads contain ambiguous nucleotides, more than one mismatch at the beginning of a read and atypical lengths. This process shrink the problem set and reduce the computation complexity. The Needleman-Wunsch is performed for PSA alignment processing. They only take pairwise distance < 0.1 and discard the rest reads to speed up processing time and save storage space. The k-mer is calculated and assigned a score for each pair of sequences. There is also a threshold for the k-mer score (default is 0.5). The Hcluster is introduced for assignment sequences into OTUs. This new algorithm can process the distance information on-the-fly. It has two different type of label for each sequence, active or inactive. Active define as the sequence have not enough distance information for clustering; inactive defines as the sequence have no distance information or already be clustered. This cluster algorithm, Hcluster, is a general classification method which can be use in any kind of clustering problem not limit to this problem. They compare ESPRIT with DOTUR and MOTHUR which are the commonly software use in many mstagenomics projects for several years. The result shows that using DOTUR or MOTHUR for species richness estimate will over estimate the number. The next-generation sequencing technology can produce tons of sequence in a lower price compare with previous method and ESPRIT give us a better aspect to study microorganism.

I think the major problems in metagenomics is how to efficiently processing huge amount of data and how to do data mining. This method give me a hint that we don’t need to improve every steps instead sometimes replace it will have a surprised result.

Paper Link

Related Posts with Thumbnails

Related posts

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>