QC Scheme: qualibact-v1.1 for Haemophilus influenzae

Derived from 11,772 genomes: 299 from RefSeq and 11,473 from other sources. For the derivation pipeline and the PASS / WARN / FAIL verdict model, see the methods page for qualibact-v1.1.

Applied to the full All-The-Bacteria dataset, these thresholds place 10,503 genomes at PASS, 1,422 at WARN, and 575 at FAIL (12,500 assessed in total). The per-tier genome lists can be downloaded below in .csv.gz format; the FAIL list also records the reason each assembly was rejected.

QualiBact qualibact-v1.1 thresholds for Haemophilus influenzae, refined from v1.0 based on the Nov 2025 expert-feedback survey.

Two pins on her recommendation:

No. of contigs upper pinned at 100 (engine v1.0-rev2 emits 140). Assemblies with >100 contigs are most often associated with contamination.
Genome_Size upper pinned at 2.0 Mb (engine v1.0-rev2 emits 2.2 Mb). Genomes >2 Mbp are most often contamination. The engine's auto-derived upper was pulled up by two reference assemblies (GCF_002985465.2 and GCF_002984345.2) which on independent ANI checks are inconclusive against H. influenzae (~94% fastANI to other H. influenzae genomes) and are probably not the species at all.

All other thresholds inherit unchanged from v1.0. WARN tier preserves the engine's tighter values. See the methods page for qualibact-v1.1 for the full pipeline.

Acknowledgements

Threshold values and rationale for Haemophilus influenzae (qualibact-v1.1) contributed by:

Margo Diricks, Forschungszentrum Borstel, Germany

Nov 2025 expert-feedback survey — flagged that >100 contigs are most often contamination, and that two RefSeq references pushing the assembly-size upper bound up have inconclusive ANI (~94 %) against H. influenzae

Summary table

This table summarises the distribution of each metric, including standard deviation, mean, median, and percentiles.

A combined summary table across all species is available on the summary page.

summary.csv

Show summary statistics inline

Metric	Distribution	n	Mean	SD	Min	Q1	Median	Q3	Max
N50	non-normal	11,473	284,724	268,108	16,210	122,077	202,427	322,198	1,158,059
no_of_contigs	non-normal	11,473	43.2	19.61	12	30	40	52	223
longest	non-normal	11,473	476,129	246,470	56,596	287,119	502,311	563,401	1,158,059
GC_Content	non-normal	11,473	37.99	0.1	37.73	37.92	37.99	38.05	38.38
Completeness_Specific	non-normal	11,473	100	0	99.98	100	100	100	100
Contamination	non-normal	11,473	0.16	0.28	0	0	0.04	0.18	3.76
Total_Coding_Sequences	non-normal	11,473	1,783	70.25	1,642	1,729	1,774	1,830	2,108
Genome_Size	non-normal	11,473	1,848,417	50,139	1,743,262	1,810,454	1,842,859	1,881,744	2,104,737

Full statistics including KS test vs RefSeq and Wasserstein distance are in the downloadable summary.csv.

Suggested thresholds for Haemophilus influenzae (qualibact-v1.1)

Derived from 11,772 genomes including 299 RefSeq references

Both Fail and Warn bands shown as the published rounded values — easier to cite and consistent across the species page, CSV downloads, and downstream QC tools.

Metric	Fail below	Warn below	Warn above	Fail above
Genome_Size	1,700,000	1,700,000	2,000,000	2,000,000
GC_Content	37.7	37.8	38.3	38.5
Total_Coding_Sequences	1,600	1,600	2,000	2,200
Completeness_Specific	93	100	-	-
Contamination	-	-	1	6
N50	30,000	52,000	-	-
no_of_contigs	-	-	100	100
longest	-	-	-	-

How to read this: a value between the two warn columns is typical for this species and passes QC. A value between a warn column and the corresponding fail column is borderline — worth a manual look but not an outright failure. A value outside the fail columns is unusual enough to fail QC.

Download CSV

All-The-Bacteria — PASS / WARN / FAIL genome lists

The published rounded thresholds (the values in the table above) were applied to the full AllTheBacteria-2024-08 set for this species. Each row carries the per-metric verdict and, where applicable, the reason a genome was demoted to WARN or FAIL. Files are gzipped CSV.

PASS genomes WARN genomes FAIL genomes

CDS vs Genome Size

This plot shows the relationship between the number of coding sequences (CDS) and genome size — how the number of genes scales with assembly length. The relationship should be roughly linear: as genome size increases, the number of coding sequences should rise proportionally. A secondary trend line or non-linear behaviour can indicate either bona fide sub-populations within the retained genomes (e.g. distinct sub-clades) or residual contamination that survived filtering.

RefSeq distributions

GC Content (RefSeq)

1 / 5

Histogram (SRA vs RefSeq)

Histogram comparing SRA to RefSeq; each bar shows genome density across value ranges to highlight shifts, peaks, or outliers.

QQ plot (SRA vs RefSeq)

QQ (quantile-quantile) plot comparing SRA and RefSeq. Points along the diagonal follow the expected distribution; deviations indicate skew, outliers, or other systematic differences.

Table of included RefSeq complete genomes

A table of complete RefSeq genomes for Haemophilus influenzae used to calibrate this scheme. The file includes accessions, some sample information, genome size, GC content, and other key metrics.

Download table

Filtered plots

Longest contig vs Completeness

1 / 15

These plots show genomes before and after filtering to highlight the outliers removed:

Left: heatmap of all genomes in the dataset.
Middle: a representative sample of genomes, with anomalies highlighted in purple.
Right: the filtered distribution after applying the thresholds.

The filtered distribution shown here may not exactly match the published thresholds because additional rounding and curator adjustments are applied on top.

All

Haemophilus_influenzae_all_longest_Completeness_Specific.pngDownload

Sample

Haemophilus_influenzae_sample_longest_Completeness_Specific.pngDownload

Filtered

Haemophilus_influenzae_filt_longest_Completeness_Specific.pngDownload

All QC schemes for this species

qualibact-v1.0 qualibact-v1.1 (current)

QC Scheme: qualibact-v1.1 for Haemophilus influenzae