Assembly and binning for metagenomic samples with high intra-population diversity
Binning is to cluster similar contigs assembled from reads based on the coverage and sequences composition (e.g., kmer composition) of contigs. The resulting bin, which we also called metagenomic assembled genome (MAG), as a representative of the true collection of species population genome.
- Assembly and Binning for high inter-population diversity samples. For many environmental samples, especially environments that are highly heterogeneous, e.g., soil, genome coverage is a big problem because the number of sequences that we got from sequencing for each species population is small when there are so many species (and they are not closely related species) and the sequencing efforts is fixed. This can be easily derived from the Lander-Waterman equation (1). Specifically, we can use the same sequencing efforts per sample (e.g., #N of sequences) to infer the overall coverage of samples from different environments, as a indicator of sample inter-population diversity. Therefore, assembling for those low coverage population genomes will be less efficient. We will use soil or activated sludge as examples below, which are well known for high inter-population diversity and also ocean samples, which are known for low inter-population diversity.
Written on December 20, 2023