The first two bacterial genomes, including that of Haemophilus influenzae, were sequenced almost 30 years ago. The cost for sequencing Haemophilus influenzae was over $1,000,000, and its determination took several years. Today, an entire bacterial genome can be sequenced in hours for less than $100. Today, over 1.9 million bacterial genomes are available in databases like AllTheBacteria, along with large amounts of information in multi-omic datasets.
There are many publicly available bacterial genome sequences. Most were created using short-read sequencing. This often results in the outcomes being in fragmented parts. Long-read sequencing can provide complete results. This method shows that the structure of bacterial genomes, meaning the order and direction of genes on the chromosome, is very different in many species. There is increasing evidence that genome structure is important for gene expression and thus for the characteristics of the bacteria.
The sequencing of bacterial genomes has become a routine practice in numerous laboratories. By December 2024, GenBank had accumulated nearly 2.4 million bacterial genome sequences. The majority of these sequences were produced using short-read sequencing, particularly with the Illumina platform. This method generates a vast number of short sequences, which are then assembled to cover most of the genome intended for sequencing. However, many genomes include repeat sequences, such as insertion sequence (IS) elements and ribosomal RNA operons, which exceed the length of the short-read sequences. This presents a challenge for assembly methods, as they struggle to establish the connection between sequences flanking a repeat due to the absence of reads that span the repeat and provide necessary information. Consequently, sequences produced by short-read methods are typically draft genomes composed of multiple contigs, lacking complete end-to-end sequence information. As a result, the genome's structure, order, and orientation of the contiguous sequences remain undetermined.
Of the 211,705 sequenced bacterial genomes, most represent human pathogens from the Firmicutes and Gammaproteobacteria phyla.
Less than 10% of publicly available (National Center for Biotechnology Information (NCBI) database) sequenced bacterial genomes belong to the prolific NP-producing Actinomycetota phylum (19,177 genomes are sequenced)
As of May 2019, public sequencing data (from the NCBI database) exists for more than 211,000 Bacteria , providing rich genomic diversity
Several thousand more genomes are represented in metagenomics datasets
see also:
Microbial Genes
Mycoplasma genitalium