Skip to content

qualibact v1.0 methods

General Strengths

  • The pipeline is fully automated, generic, and can be applied to any set of genomes — including arbitrary subsets such as species, clonal complexes, or lineages.

  • Quality assessment is based on multiple standard metrics (e.g. N50, number of contigs, genome size, GC%), allowing reproducible filtering.

  • Species-specific thresholds can be derived from available reference genomes, and thresholds can be updated as more genomes are added.

  • Variation between species — even within a genus — supports the need for species-level cutoffs, which this approach accommodates.

  • Variation between SRA and Refseq: We have observed that Genome size and assembly length distributions differ significantly between RefSeq and SRA (i.e. ATB). The cause is unclear, but relying on RefSeq-derived thresholds alone may result in unfairly excluding valid genomes. This approach combines both datasets to ensure a more inclusive and representative set of thresholds.

Caveats

  • Species Definitions Depend on GTDB: This tools uses Sylph for species designation, so all GTDB-related quirks apply. E.g., Shigella spp. is included in E. coli, and there are issues separating Burkholderia mallei from Burkholderia pseudomallei and Bordetella pertussis/Bordetella parapertussis from Bordetella bronchiseptica.
  • No Ground Truth Claims: This evaluation reflects what has been previously observed in available datasets. It does not attempt to define a universal "ground truth" for any species.
  • Assembly-Method Specific: The metrics (e.g. N50, number of contigs) are meaningful primarily for assemblies generated with Shovill (or similar SPAdes-based pipelines). Exact thresholds will vary for long-read or alternative assemblers like SKESA. However, not using Shovill implies rejection of the Torstyverse, which is heresy.
  • Long-Read Assemblies Not Explicitly Handled: These cutoffs are not designed for long-read assemblies. That said, genome size and GC content thresholds should still apply, and it's reasonable to expect long-read assemblies to exceed the quality of short-read derived thresholds — not fall below them.
  • Generic vs. Specific Tradeoff: While the generic approach is broadly applicable, it may miss species-specific quality nuances or lineage-level exceptions.

Citation

If you use QualiBact, please cite the following:

Alikhan, NF. Species specific quality control of bacterial de novo genome assemblies using QualiBact. Available at: https://github.com/cgps-group/qualibact (Accessed: [insert date]).

Datasets

Criteria cutoffs were determined using a combination of standard statistical methods and machine learning techniques to identify outliers and establish robust QC thresholds. Two datasets were used for this purpose:

  1. Allthebacteria dataset: The 2024-08 release comprises 2.4 million uniformly reprocessed genome assemblies, including taxonomic estimates aligned to the GTDB phylogeny. Taxonomic classification was based on sylph (v0.5.1) results with the GTDB r214 database. For more information on how sylph was run, see the Allthebacteria documentation. Only species with more than 1,000 genome records (as defined by sylph) were included. For more information about sylph visit sylph documentation.

  2. NCBI RefSeq complete genomes: Complete genomes from RefSeq were used to compare metrics such as genome size, number of coding sequences, and GC content. Metadata was downloaded using the NCBI Datasets command-line tool, filtered for completeness, cached as JSON, and parsed for analysis.

Quality control metrics

Pre-calculated "assembly stats" (N50, longest contig, total assembly length, number of contigs, and longest contig size) and CheckM results (contamination and completeness) were used. The GC content was recalculated to obtain data with at least two decimals (higher precision).

Outlier Detection and Filtering

To identify and remove potential outliers prior to downstream analyses, an unsupervised machine learning approach using the Isolation Forest algorithm was implemented. This method is well-suited for high-dimensional biological data and does not require labeled training examples.

The outlier filter was applied to features representing key genome quality and assembly metrics: total_length, GC_Content, N50, number (contigs), longest (contig length), Completeness_Specific, and Contamination. The scikit-learn (v1.6.1) IsolationForest implementation was used with a fixed random seed (random_state=42) for reproducibility. Each sample received an anomaly score, and was classified as either inlier (anomaly = 1) or outlier (anomaly = -1). Outlier-filtered datasets were retained in parallel with unfiltered data for comparative analyses. Source code for this process is available at QualiBact.

Statistical Analysis and Threshold Selection

Summary statistics (mean, standard deviation, quartiles, interquartile range) were calculated for genome quality metrics. Normality was assessed using the D’Agostino–Pearson test. For metrics without RefSeq values (e.g., N50, number of contigs), lower and upper bounds were set by applying Isolation Forest filtering, then selecting the 0.5th and 99.5th percentiles of the filtered data. For genome size and GC content, distributions were compared to RefSeq using the Kolmogorov–Smirnov test and Wasserstein distance. Final bounds were set as the most extreme values between RefSeq-derived limits and the 0.5th–99.5th percentile range of SRA genomes after filtering. Histograms were generated to visualize comparative distributions.

PASS / WARN / FAIL verdict zones

For each metric on a species page, the website renders a four-bound view:

Column What it is How it is computed
Fail below Below this value the assembly is unusual enough to fail. FINAL_LOWER — the 0.5th percentile of the post-Isolation-Forest pool (or the more extreme of that and the RefSeq-derived lower for metrics with RefSeq calibration).
Warn below Below this value but above Fail-below is borderline. WARN_LOWER — the 2.5th percentile of the same pool.
Warn above Above this value but below Fail-above is borderline. WARN_UPPER — the 97.5th percentile of the same pool.
Fail above Above this value the assembly fails. FINAL_UPPER — the 99.5th percentile (or the more extreme of that and the RefSeq-derived upper).

Some metrics are bounded only on one side — N50 has only a lower bound (a very high N50 is never a failure), Contamination has only an upper bound, Completeness_Specific has only a lower bound.

Percentile bands

The FINAL and WARN bounds are not derived from a per-metric ± buffer; both come directly from percentiles of the species pool:

Band Lower percentile Upper percentile Meaning
FINAL (= FAIL outside) 0.5 99.5 Below FINAL_lower or above FINAL_upper → assembly fails
WARN 2.5 97.5 Between WARN and FINAL → unusual but not failing

Both percentiles are computed on the post-IsolationForest species pool (i.e. after IF outlier removal on the HQ-filtered samples). For Genome_Size and GC_Content, where a RefSeq reference set is available, the published FINAL bounds take the more extreme of (RefSeq-derived limit) and (SRA-pool 0.5 / 99.5 percentile).

v1.0 itself ships no overrides — every bound is the raw engine percentile output. Expert-requested pins live in v1.1; see the v1.1 methods page for the hand-pinned threshold rationale table.