The reference dataset for this species has substantial quality issues. Thresholds should be treated as indicative only.
Derived from 12 genomes: 6 from RefSeq and 6 from other sources. For the derivation pipeline and the PASS / WARN / FAIL verdict model, see the methods page for REFSEQ-QC-v1.
Applied to the full All-The-Bacteria dataset, these thresholds place 0 genomes at PASS, 0 at WARN, and 42 at FAIL (42 assessed in total). The per-tier genome lists can be downloaded below in .csv.gz format; the FAIL list also records the reason each assembly was rejected.
This table summarises the distribution of each metric, including standard deviation, mean, median, and percentiles.
A combined summary table across all species is available on the summary page.
| Metric | Distribution | n | Mean | SD | Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|---|---|---|---|---|
| N50 | insufficient_data | 6 | 68,762 | 0 | 68,762 | 68,762 | 68,762 | 68,762 | 68,762 |
| no_of_contigs | insufficient_data | 6 | 86.67 | 0.94 | 86 | 86 | 86 | 87.5 | 88 |
| longest | insufficient_data | 6 | 203,866 | 3.56 | 203,861 | 203,863 | 203,868 | 203,869 | 203,869 |
| GC_Content | insufficient_data | 6 | 56.26 | 0 | 56.26 | 56.26 | 56.26 | 56.26 | 56.26 |
| Completeness_Specific | insufficient_data | 6 | 99.96 | 0.01 | 99.95 | 99.95 | 99.96 | 99.97 | 99.97 |
| Contamination | insufficient_data | 6 | 1.22 | 0.02 | 1.2 | 1.2 | 1.23 | 1.24 | 1.24 |
| Total_Coding_Sequences | insufficient_data | 6 | 3,408 | 4.23 | 3,401 | 3,405 | 3,408 | 3,411 | 3,413 |
| Genome_Size | insufficient_data | 6 | 3,061,036 | 133.44 | 3,060,847 | 3,060,918 | 3,061,127 | 3,061,132 | 3,061,134 |
Full statistics including KS test vs RefSeq and Wasserstein distance are in the downloadable summary.csv.
Derived from 12 genomes including 6 RefSeq references
Both Fail and Warn bands shown as the published rounded values — easier to cite and consistent across the species page, CSV downloads, and downstream QC tools.
| Metric | Fail below | Warn below | Warn above | Fail above |
|---|---|---|---|---|
| Genome_Size | 3,000,000 | 3,000,000 | 3,100,000 | 3,100,000 |
| GC_Content | 56.2 | 56.2 | 56.3 | 56.3 |
| Total_Coding_Sequences | 3,400 | 3,400 | 3,500 | 3,500 |
| Completeness_Specific | 99 | 99 | - | - |
| Contamination | - | - | 2 | 2 |
| N50 | 68,000 | 68,000 | - | - |
| no_of_contigs | - | - | 90 | 90 |
| longest | - | - | - | - |
How to read this: a value between the two warn columns is typical for this species and passes QC. A value between a warn column and the corresponding fail column is borderline — worth a manual look but not an outright failure. A value outside the fail columns is unusual enough to fail QC.
The published rounded thresholds (the values in the table above) were applied to the full AllTheBacteria-2024-08 set for this species. Each row carries the per-metric verdict and, where applicable, the reason a genome was demoted to WARN or FAIL. Files are gzipped CSV.
This plot shows the relationship between the number of coding sequences (CDS) and genome size — how the number of genes scales with assembly length. The relationship should be roughly linear: as genome size increases, the number of coding sequences should rise proportionally. A secondary trend line or non-linear behaviour can indicate either bona fide sub-populations within the retained genomes (e.g. distinct sub-clades) or residual contamination that survived filtering.
Histogram comparing SRA to RefSeq; each bar shows genome density across value ranges to highlight shifts, peaks, or outliers.
QQ (quantile-quantile) plot comparing SRA and RefSeq. Points along the diagonal follow the expected distribution; deviations indicate skew, outliers, or other systematic differences.
A table of complete RefSeq genomes for Renibacterium salmoninarum used to calibrate this scheme. The file includes accessions, some sample information, genome size, GC content, and other key metrics.
Per-assembly inputs the engine used to derive the Renibacterium salmoninarum reference distribution for this scheme: sample, sylph species call, N50, contig count, longest contig, total length, completeness, contamination, total coding sequences, genome size, GC content. Gzipped CSV.