CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads
Sourav Chatterji, et al. RECOMB.
When we want to study microbial diversity, we face a huge challenge that most of microbial cannot be cultured in the lab. Therefore we cannot use the traditional procedure to analyze the microbial. The next generation sequencing technology and shot-gun sequencing become the solution to analyze the microbial diversity, but other issues arise when we apply those techniques. We sequence all short reads from different species at the same time, and before we do any analysis we need to classify them. The reads binning, distinguishing which read belong to which genome, become the first issue we need to face. As long as we can accurate classify the reads, we can know the richness of this community. CompostBin is one of these kinds software. First, it uses principal component analysis (PCA) to extract the feature of every reads. Each sequence has a feature matrix which denotes the frequency of k-mer. In this paper they use 6-mer as the feature therefore 46, 4 nucleotide in DNA, columns for each sequence and first three principal components are used for principal component analysis (PCA). Then a 6-nearest neighbor graph is constructed for clustering, the vector represent the sequence and the edge exist if one of the sequence is a 6-nearest neighbor of the other sequences. The edge also weighted by the exponential inverse of their normalized Euclidean distance. By incorporating the Phylogenetic Marker with the graph the CompostBin can calculate normalized cut score and bin the read into two different bins. This step can iteratively execute until reaching the bin number we setup. They test on some simulates datasets which generate the simulated Sanger reads from known genome datasets by ReadSim. They intend to make datasets have different characteristic such as different number of species, relative abundance, phylogenetic diversity and GC content, to test the accuracy of the program. For the simulated datasets there are 10 out of 12 dataset have less than 6% binning error. The other two datasets have higher misclassified error rate due to have the small phylogenetic distance. They also apply to real metagenomics date, Glyassy-winged sharpshooter endosymbionts. The result shows 5.9% error compare with original paper.
It shows a pretty good performance on those present results. This paper said this program is still progressing. They want to test on different clustering method since the PCA cannot catch the nonlinear structure. They claim the first three principal components are enough for clustering, but this is under simulated result. Does the first three components really enough for those large communities? Also, the 6-mer is another important criterion, can we increase or decrease for different purposes? Like increase k-mer to get more detail characteristics and decrease k-mer to get less running time. Although this program doesn’t require any training set, it is based on very strong assumption to determine the parameters.






[...] CompostBin: A DNA composition-based algorithm fοr binning … [...]
Great article – I enjoyed it very much!