TY - JOUR
T1 - SeqWho
T2 - Reliable, rapid determination of sequence file identity using k-mer frequencies in Random Forest classifiers
AU - Bennett, Christopher
AU - Thornton, Micah
AU - Park, Chanhee
AU - Henry, Gervaise
AU - Zhang, Yun
AU - Malladi, Venkat
AU - Kim, Daehwan
N1 - Publisher Copyright:
© 2022 The Author(s) 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
PY - 2022/4/1
Y1 - 2022/4/1
N2 - Motivation: With the vast improvements in sequencing technologies and increased number of protocols, sequencing is being used to answer complex biological problems. Subsequently, analysis pipelines have become more time consuming and complicated, usually requiring highly extensive prevalidation steps. Here, we present SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities. Results: Using one of our primary models, we show that our method accurately and rapidly classifies human and mouse sequences from nine different sequencing libraries by species, library and both together, 98.32%, 97.86% and 96.38% of the time, respectively. Ultimately, we demonstrate that SeqWho is a powerful method for reliably validating the quality and identity of the sequencing files used in any pipeline.
AB - Motivation: With the vast improvements in sequencing technologies and increased number of protocols, sequencing is being used to answer complex biological problems. Subsequently, analysis pipelines have become more time consuming and complicated, usually requiring highly extensive prevalidation steps. Here, we present SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities. Results: Using one of our primary models, we show that our method accurately and rapidly classifies human and mouse sequences from nine different sequencing libraries by species, library and both together, 98.32%, 97.86% and 96.38% of the time, respectively. Ultimately, we demonstrate that SeqWho is a powerful method for reliably validating the quality and identity of the sequencing files used in any pipeline.
UR - http://www.scopus.com/inward/record.url?scp=85128388562&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85128388562&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btac050
DO - 10.1093/bioinformatics/btac050
M3 - Article
C2 - 35134110
AN - SCOPUS:85128388562
SN - 1367-4803
VL - 38
SP - 1830
EP - 1837
JO - Bioinformatics
JF - Bioinformatics
IS - 7
ER -