Estimating sequencing error rates, without alignment

Given a set of sequenced reads, how to determine if the sequencing run is good or not? Finding the quality of the reads, which includes estimating sequencing error rates and bias, has been an important first step in numerous Bioinformatics pipelines.

Previous ways of estimating sequencing error rates include mapping the reads to reference genomes and inferring error rates from Phred quality scores. Unfortunately, the reference genomes may be missing or different from the genomes that are actually sequenced, especially in metagenomic samples. On the other hand, Phred quality scores can produce biased estimates if they are uncalibrated.

We therefore propose a new framework of estimating sequencing error and bias, called skiver, which works without the need for reference genome or relying on Phred scores.

Workflow of skiver.

The key ideas of skiver is to use (k, v)-mer sketches to represent the large amount of sequencing reads. A (k, v)-mer is a segment of length k+v, where the first k bases are the key and the last v bases are the value. By grouping the (k, v)-mers with the same key together, we can identify the consensus value, as well as estimate the frequency of sequencing errors.

Experiments on various real datasets show that skiver is able to accurately estimate the sequencing error rate and infer the percentage of k-mers in the read set that are free of sequencing errors. In addition, skiver can estimate the substitution, insertion, and deletion rates, revealing the bias of various sequencing platforms.

Skiver’s estimation of error rates and error spectra on various metagenomic samples.

Finally, skiver is computationally lightweight, making it a handy tool for quality control in modern Bioinformatic pipelines.

Computational resources needed by skiver and other baselines.

Estimating sequencing error rates, without alignment
Older post

Single-cell culturomics: Accelerating targeted bacterial isolation from complex communities

Traditional culturing methods are slow, labor-intensive, and costly, limiting their scalability for microbial studies. To address this challenge, we developed a workflow that leverages automated single-cell dispensing technology to enable high-throughput isolation of Bifidobacterium sp.

Newer post

Globally prevalent gut phage families from long-read metagenomics

Leveraging long-read metagenomics, we uncovered globally highly prevalent gut phage families, many of which infect Firmicutes hosts, have broad host range, and actively replicate in the gut.

Estimating sequencing error rates, without alignment