Genotype imputation reference panel selection using maximal phylogenetic diversity

Peng Zhang, Xiaowei Zhan, Noah A. Rosenberg, Sebastian Zöllner

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

The recent dramatic cost reduction of next-generation sequencing technology enables investigators to assess most variants in the human genome to identify risk variants for complex diseases. However, sequencing large samples remains very expensive. For a study sample with existing genotype data, such as array data from genome-wide association studies, a cost-effective approach is to sequence a subset of the study sample and then to impute the rest of the study sample, using the sequenced subset as a reference panel. The use of such an internal reference panel identifies population-specific variants and avoids the problem of a substantial mismatch in ancestry background between the study population and the reference population. To efficiently select an internal panel, we introduce an idea of phylogenetic diversity from mathematical phylogenetics and comparative genomics. We propose the "most diverse reference panel", defined as the subset with the maximal "phylogenetic diversity", thereby incorporating individuals that span a diverse range of genotypes within the sample. Using data both from simulations and from the 1000 Genomes Project, we show that the most diverse reference panel can substantially improve the imputation accuracy compared to randomly selected reference panels, especially for the imputation of rare variants. The improvement in imputation accuracy holds across different marker densities, reference panel sizes, and lengths for the imputed segments. We thus propose a novel strategy for planning sequencing studies on samples with existing genotype data.

Original languageEnglish (US)
Pages (from-to)319-330
Number of pages12
JournalGenetics
Volume195
Issue number2
DOIs
StatePublished - Oct 2013

Fingerprint

Genotype
Population
Costs and Cost Analysis
Genome-Wide Association Study
Human Genome
Genomics
Research Personnel
Genome
Technology

ASJC Scopus subject areas

  • Genetics

Cite this

Genotype imputation reference panel selection using maximal phylogenetic diversity. / Zhang, Peng; Zhan, Xiaowei; Rosenberg, Noah A.; Zöllner, Sebastian.

In: Genetics, Vol. 195, No. 2, 10.2013, p. 319-330.

Research output: Contribution to journalArticle

Zhang, Peng ; Zhan, Xiaowei ; Rosenberg, Noah A. ; Zöllner, Sebastian. / Genotype imputation reference panel selection using maximal phylogenetic diversity. In: Genetics. 2013 ; Vol. 195, No. 2. pp. 319-330.
@article{604cc95a27e24f879ac53acf55db36fb,
title = "Genotype imputation reference panel selection using maximal phylogenetic diversity",
abstract = "The recent dramatic cost reduction of next-generation sequencing technology enables investigators to assess most variants in the human genome to identify risk variants for complex diseases. However, sequencing large samples remains very expensive. For a study sample with existing genotype data, such as array data from genome-wide association studies, a cost-effective approach is to sequence a subset of the study sample and then to impute the rest of the study sample, using the sequenced subset as a reference panel. The use of such an internal reference panel identifies population-specific variants and avoids the problem of a substantial mismatch in ancestry background between the study population and the reference population. To efficiently select an internal panel, we introduce an idea of phylogenetic diversity from mathematical phylogenetics and comparative genomics. We propose the {"}most diverse reference panel{"}, defined as the subset with the maximal {"}phylogenetic diversity{"}, thereby incorporating individuals that span a diverse range of genotypes within the sample. Using data both from simulations and from the 1000 Genomes Project, we show that the most diverse reference panel can substantially improve the imputation accuracy compared to randomly selected reference panels, especially for the imputation of rare variants. The improvement in imputation accuracy holds across different marker densities, reference panel sizes, and lengths for the imputed segments. We thus propose a novel strategy for planning sequencing studies on samples with existing genotype data.",
author = "Peng Zhang and Xiaowei Zhan and Rosenberg, {Noah A.} and Sebastian Z{\"o}llner",
year = "2013",
month = "10",
doi = "10.1534/genetics.113.154591",
language = "English (US)",
volume = "195",
pages = "319--330",
journal = "Genetics",
issn = "0016-6731",
publisher = "Genetics Society of America",
number = "2",

}

TY - JOUR

T1 - Genotype imputation reference panel selection using maximal phylogenetic diversity

AU - Zhang, Peng

AU - Zhan, Xiaowei

AU - Rosenberg, Noah A.

AU - Zöllner, Sebastian

PY - 2013/10

Y1 - 2013/10

N2 - The recent dramatic cost reduction of next-generation sequencing technology enables investigators to assess most variants in the human genome to identify risk variants for complex diseases. However, sequencing large samples remains very expensive. For a study sample with existing genotype data, such as array data from genome-wide association studies, a cost-effective approach is to sequence a subset of the study sample and then to impute the rest of the study sample, using the sequenced subset as a reference panel. The use of such an internal reference panel identifies population-specific variants and avoids the problem of a substantial mismatch in ancestry background between the study population and the reference population. To efficiently select an internal panel, we introduce an idea of phylogenetic diversity from mathematical phylogenetics and comparative genomics. We propose the "most diverse reference panel", defined as the subset with the maximal "phylogenetic diversity", thereby incorporating individuals that span a diverse range of genotypes within the sample. Using data both from simulations and from the 1000 Genomes Project, we show that the most diverse reference panel can substantially improve the imputation accuracy compared to randomly selected reference panels, especially for the imputation of rare variants. The improvement in imputation accuracy holds across different marker densities, reference panel sizes, and lengths for the imputed segments. We thus propose a novel strategy for planning sequencing studies on samples with existing genotype data.

AB - The recent dramatic cost reduction of next-generation sequencing technology enables investigators to assess most variants in the human genome to identify risk variants for complex diseases. However, sequencing large samples remains very expensive. For a study sample with existing genotype data, such as array data from genome-wide association studies, a cost-effective approach is to sequence a subset of the study sample and then to impute the rest of the study sample, using the sequenced subset as a reference panel. The use of such an internal reference panel identifies population-specific variants and avoids the problem of a substantial mismatch in ancestry background between the study population and the reference population. To efficiently select an internal panel, we introduce an idea of phylogenetic diversity from mathematical phylogenetics and comparative genomics. We propose the "most diverse reference panel", defined as the subset with the maximal "phylogenetic diversity", thereby incorporating individuals that span a diverse range of genotypes within the sample. Using data both from simulations and from the 1000 Genomes Project, we show that the most diverse reference panel can substantially improve the imputation accuracy compared to randomly selected reference panels, especially for the imputation of rare variants. The improvement in imputation accuracy holds across different marker densities, reference panel sizes, and lengths for the imputed segments. We thus propose a novel strategy for planning sequencing studies on samples with existing genotype data.

UR - http://www.scopus.com/inward/record.url?scp=84884921432&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84884921432&partnerID=8YFLogxK

U2 - 10.1534/genetics.113.154591

DO - 10.1534/genetics.113.154591

M3 - Article

C2 - 23934887

AN - SCOPUS:84884921432

VL - 195

SP - 319

EP - 330

JO - Genetics

JF - Genetics

SN - 0016-6731

IS - 2

ER -