BAYSIC

A Bayesian method for combining sets of genome variants with improved specificity and sensitivity

Brandi L. Cantarel, Daniel Weaver, Nathan McNeill, Jianhua Zhang, Aaron J. Mackey, Justin Reese

Research output: Contribution to journalArticle

25 Citations (Scopus)

Abstract

Background: Accurate genomic variant detection is an essential step in gleaning medically useful information from genome data. However, low concordance among variant-calling methods reduces confidence in the clinical validity of whole genome and exome sequence data, and confounds downstream analysis for applications in genome medicine.Here we describe BAYSIC (BAYeSian Integrated Caller), which combines SNP variant calls produced by different methods (e.g. GATK, FreeBayes, Atlas, SamTools, etc.) into a more accurate set of variant calls. BAYSIC differs from majority voting, consensus or other ad hoc intersection-based schemes for combining sets of genome variant calls. Unlike other classification methods, the underlying BAYSIC model does not require training using a " gold standard" of true positives. Rather, with each new dataset, BAYSIC performs an unsupervised, fully Bayesian latent class analysis to estimate false positive and false negative error rates for each input method. The user specifies a posterior probability threshold according to the user's tolerance for false positive and false negative errors; lowering the posterior probability threshold allows the user to trade specificity for sensitivity while raising the threshold increases specificity in exchange for sensitivity.Results: We assessed the performance of BAYSIC in comparison to other variant detection methods using ten low coverage (~5X) samples from The 1000 Genomes Project, a tumor/normal exome pair (40X), and exome sequences (40X) from positive control samples previously identified to contain clinically relevant SNPs. We demonstrated BAYSIC's superior variant-calling accuracy, both for somatic mutation detection and germline variant detection.Conclusions: BAYSIC provides a method for combining sets of SNP variant calls produced by different variant calling programs. The integrated set of SNP variant calls produced by BAYSIC improves the sensitivity and specificity of the variant calls used as input. In addition to combining sets of germline variants, BAYSIC can also be used to combine sets of somatic mutations detected in the context of tumor/normal sequencing experiments.

Original languageEnglish (US)
Article number104
JournalBMC Bioinformatics
Volume15
Issue number1
DOIs
StatePublished - Apr 12 2014

Fingerprint

Bayes Theorem
Bayesian Methods
Specificity
Exome
Genome
Genes
Single Nucleotide Polymorphism
Sensitivity and Specificity
Tumors
Posterior Probability
False Positive
Tumor
Germ-Line Mutation
Mutation
Atlases
Politics
Latent Class Analysis
Majority Voting
Medicine
Neoplasms

Keywords

  • Bayesian
  • Cancer
  • Genome variants
  • Latent class analysis
  • SNP
  • Somatic mutation

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

BAYSIC : A Bayesian method for combining sets of genome variants with improved specificity and sensitivity. / Cantarel, Brandi L.; Weaver, Daniel; McNeill, Nathan; Zhang, Jianhua; Mackey, Aaron J.; Reese, Justin.

In: BMC Bioinformatics, Vol. 15, No. 1, 104, 12.04.2014.

Research output: Contribution to journalArticle

Cantarel, Brandi L. ; Weaver, Daniel ; McNeill, Nathan ; Zhang, Jianhua ; Mackey, Aaron J. ; Reese, Justin. / BAYSIC : A Bayesian method for combining sets of genome variants with improved specificity and sensitivity. In: BMC Bioinformatics. 2014 ; Vol. 15, No. 1.
@article{a5fca6513c6c4ef194080a39959bcc84,
title = "BAYSIC: A Bayesian method for combining sets of genome variants with improved specificity and sensitivity",
abstract = "Background: Accurate genomic variant detection is an essential step in gleaning medically useful information from genome data. However, low concordance among variant-calling methods reduces confidence in the clinical validity of whole genome and exome sequence data, and confounds downstream analysis for applications in genome medicine.Here we describe BAYSIC (BAYeSian Integrated Caller), which combines SNP variant calls produced by different methods (e.g. GATK, FreeBayes, Atlas, SamTools, etc.) into a more accurate set of variant calls. BAYSIC differs from majority voting, consensus or other ad hoc intersection-based schemes for combining sets of genome variant calls. Unlike other classification methods, the underlying BAYSIC model does not require training using a {"} gold standard{"} of true positives. Rather, with each new dataset, BAYSIC performs an unsupervised, fully Bayesian latent class analysis to estimate false positive and false negative error rates for each input method. The user specifies a posterior probability threshold according to the user's tolerance for false positive and false negative errors; lowering the posterior probability threshold allows the user to trade specificity for sensitivity while raising the threshold increases specificity in exchange for sensitivity.Results: We assessed the performance of BAYSIC in comparison to other variant detection methods using ten low coverage (~5X) samples from The 1000 Genomes Project, a tumor/normal exome pair (40X), and exome sequences (40X) from positive control samples previously identified to contain clinically relevant SNPs. We demonstrated BAYSIC's superior variant-calling accuracy, both for somatic mutation detection and germline variant detection.Conclusions: BAYSIC provides a method for combining sets of SNP variant calls produced by different variant calling programs. The integrated set of SNP variant calls produced by BAYSIC improves the sensitivity and specificity of the variant calls used as input. In addition to combining sets of germline variants, BAYSIC can also be used to combine sets of somatic mutations detected in the context of tumor/normal sequencing experiments.",
keywords = "Bayesian, Cancer, Genome variants, Latent class analysis, SNP, Somatic mutation",
author = "Cantarel, {Brandi L.} and Daniel Weaver and Nathan McNeill and Jianhua Zhang and Mackey, {Aaron J.} and Justin Reese",
year = "2014",
month = "4",
day = "12",
doi = "10.1186/1471-2105-15-104",
language = "English (US)",
volume = "15",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - BAYSIC

T2 - A Bayesian method for combining sets of genome variants with improved specificity and sensitivity

AU - Cantarel, Brandi L.

AU - Weaver, Daniel

AU - McNeill, Nathan

AU - Zhang, Jianhua

AU - Mackey, Aaron J.

AU - Reese, Justin

PY - 2014/4/12

Y1 - 2014/4/12

N2 - Background: Accurate genomic variant detection is an essential step in gleaning medically useful information from genome data. However, low concordance among variant-calling methods reduces confidence in the clinical validity of whole genome and exome sequence data, and confounds downstream analysis for applications in genome medicine.Here we describe BAYSIC (BAYeSian Integrated Caller), which combines SNP variant calls produced by different methods (e.g. GATK, FreeBayes, Atlas, SamTools, etc.) into a more accurate set of variant calls. BAYSIC differs from majority voting, consensus or other ad hoc intersection-based schemes for combining sets of genome variant calls. Unlike other classification methods, the underlying BAYSIC model does not require training using a " gold standard" of true positives. Rather, with each new dataset, BAYSIC performs an unsupervised, fully Bayesian latent class analysis to estimate false positive and false negative error rates for each input method. The user specifies a posterior probability threshold according to the user's tolerance for false positive and false negative errors; lowering the posterior probability threshold allows the user to trade specificity for sensitivity while raising the threshold increases specificity in exchange for sensitivity.Results: We assessed the performance of BAYSIC in comparison to other variant detection methods using ten low coverage (~5X) samples from The 1000 Genomes Project, a tumor/normal exome pair (40X), and exome sequences (40X) from positive control samples previously identified to contain clinically relevant SNPs. We demonstrated BAYSIC's superior variant-calling accuracy, both for somatic mutation detection and germline variant detection.Conclusions: BAYSIC provides a method for combining sets of SNP variant calls produced by different variant calling programs. The integrated set of SNP variant calls produced by BAYSIC improves the sensitivity and specificity of the variant calls used as input. In addition to combining sets of germline variants, BAYSIC can also be used to combine sets of somatic mutations detected in the context of tumor/normal sequencing experiments.

AB - Background: Accurate genomic variant detection is an essential step in gleaning medically useful information from genome data. However, low concordance among variant-calling methods reduces confidence in the clinical validity of whole genome and exome sequence data, and confounds downstream analysis for applications in genome medicine.Here we describe BAYSIC (BAYeSian Integrated Caller), which combines SNP variant calls produced by different methods (e.g. GATK, FreeBayes, Atlas, SamTools, etc.) into a more accurate set of variant calls. BAYSIC differs from majority voting, consensus or other ad hoc intersection-based schemes for combining sets of genome variant calls. Unlike other classification methods, the underlying BAYSIC model does not require training using a " gold standard" of true positives. Rather, with each new dataset, BAYSIC performs an unsupervised, fully Bayesian latent class analysis to estimate false positive and false negative error rates for each input method. The user specifies a posterior probability threshold according to the user's tolerance for false positive and false negative errors; lowering the posterior probability threshold allows the user to trade specificity for sensitivity while raising the threshold increases specificity in exchange for sensitivity.Results: We assessed the performance of BAYSIC in comparison to other variant detection methods using ten low coverage (~5X) samples from The 1000 Genomes Project, a tumor/normal exome pair (40X), and exome sequences (40X) from positive control samples previously identified to contain clinically relevant SNPs. We demonstrated BAYSIC's superior variant-calling accuracy, both for somatic mutation detection and germline variant detection.Conclusions: BAYSIC provides a method for combining sets of SNP variant calls produced by different variant calling programs. The integrated set of SNP variant calls produced by BAYSIC improves the sensitivity and specificity of the variant calls used as input. In addition to combining sets of germline variants, BAYSIC can also be used to combine sets of somatic mutations detected in the context of tumor/normal sequencing experiments.

KW - Bayesian

KW - Cancer

KW - Genome variants

KW - Latent class analysis

KW - SNP

KW - Somatic mutation

UR - http://www.scopus.com/inward/record.url?scp=84899475755&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84899475755&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-15-104

DO - 10.1186/1471-2105-15-104

M3 - Article

VL - 15

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 104

ER -