A nonparametric empirical bayes approach to joint modeling of multiple sources of genomic data

Wei Pan, Kyeong S. Jeong, Yang Xie, Arkady Khodursky

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

With the rapid accumulation of various high-throughput genomic and proteomic data, one is compelled to develop new statistical methods that can take advantage of existing multiple sources of data. In our motivating example, a chromatin-immunoprecipitation (ChIP) microarray experiment was conducted to detect binding target genes of a broad transcription regulator, leucine responsive regulatory protein (Lrp) in E. coli. In addition, a cDNA microarray dataset is available to compare gene expression of the wild type with that of a mutant with the Lrp gene deleted in E. coli. It is biologically reasonable to assume that the genes with altered expression are more likely to be regulated by Lrp than those with no expression change. Hence we aim to borrow information in the gene expression data to increase statistical power to detect the binding targets of Lrp. We propose a novel joint model for protein-DNA binding data and gene expression data; under mild modeling assumptions, it is shown that the method is optimal, equivalent to a joint likelihood ratio test. We compare the joint modeling with two existing methods of combining separate analyses. We adopt a nonparametric empirical Bayes (EB) method to draw statistical inference in the joint model; in particular, we propose a new method, maximum likelihood conditional on the binding data, to estimate two prior probabilities for the expression data, which are non-identifiable based on the expression data alone. We use simulated data to demonstrate the improved performance of the joint modeling over other approaches. Application to the Lrp data also shows better performance of the joint modeling than that of analyzing the binding data alone.

Original languageEnglish (US)
Pages (from-to)709-729
Number of pages21
JournalStatistica Sinica
Volume18
Issue number2
StatePublished - Apr 2008

Fingerprint

Nonparametric Bayes
Joint Modeling
Empirical Bayes
Genomics
Protein
Joint Model
Gene Expression Data
Gene
Escherichia Coli
Empirical Bayes Method
CDNA Microarray
Conditional Maximum Likelihood
Modeling
DNA-binding Protein
Statistical Power
Prior Probability
Target
Chromatin
Maximum Likelihood Method
Proteomics

Keywords

  • ChIP-chip
  • Computational biology
  • False discovery rate
  • Gene expression
  • Lrp
  • Microarray

ASJC Scopus subject areas

  • Mathematics(all)
  • Statistics and Probability

Cite this

A nonparametric empirical bayes approach to joint modeling of multiple sources of genomic data. / Pan, Wei; Jeong, Kyeong S.; Xie, Yang; Khodursky, Arkady.

In: Statistica Sinica, Vol. 18, No. 2, 04.2008, p. 709-729.

Research output: Contribution to journalArticle

Pan, Wei ; Jeong, Kyeong S. ; Xie, Yang ; Khodursky, Arkady. / A nonparametric empirical bayes approach to joint modeling of multiple sources of genomic data. In: Statistica Sinica. 2008 ; Vol. 18, No. 2. pp. 709-729.
@article{d3b6a18cb0fb49ada4c790b2b500c6c0,
title = "A nonparametric empirical bayes approach to joint modeling of multiple sources of genomic data",
abstract = "With the rapid accumulation of various high-throughput genomic and proteomic data, one is compelled to develop new statistical methods that can take advantage of existing multiple sources of data. In our motivating example, a chromatin-immunoprecipitation (ChIP) microarray experiment was conducted to detect binding target genes of a broad transcription regulator, leucine responsive regulatory protein (Lrp) in E. coli. In addition, a cDNA microarray dataset is available to compare gene expression of the wild type with that of a mutant with the Lrp gene deleted in E. coli. It is biologically reasonable to assume that the genes with altered expression are more likely to be regulated by Lrp than those with no expression change. Hence we aim to borrow information in the gene expression data to increase statistical power to detect the binding targets of Lrp. We propose a novel joint model for protein-DNA binding data and gene expression data; under mild modeling assumptions, it is shown that the method is optimal, equivalent to a joint likelihood ratio test. We compare the joint modeling with two existing methods of combining separate analyses. We adopt a nonparametric empirical Bayes (EB) method to draw statistical inference in the joint model; in particular, we propose a new method, maximum likelihood conditional on the binding data, to estimate two prior probabilities for the expression data, which are non-identifiable based on the expression data alone. We use simulated data to demonstrate the improved performance of the joint modeling over other approaches. Application to the Lrp data also shows better performance of the joint modeling than that of analyzing the binding data alone.",
keywords = "ChIP-chip, Computational biology, False discovery rate, Gene expression, Lrp, Microarray",
author = "Wei Pan and Jeong, {Kyeong S.} and Yang Xie and Arkady Khodursky",
year = "2008",
month = "4",
language = "English (US)",
volume = "18",
pages = "709--729",
journal = "Statistica Sinica",
issn = "1017-0405",
publisher = "Institute of Statistical Science",
number = "2",

}

TY - JOUR

T1 - A nonparametric empirical bayes approach to joint modeling of multiple sources of genomic data

AU - Pan, Wei

AU - Jeong, Kyeong S.

AU - Xie, Yang

AU - Khodursky, Arkady

PY - 2008/4

Y1 - 2008/4

N2 - With the rapid accumulation of various high-throughput genomic and proteomic data, one is compelled to develop new statistical methods that can take advantage of existing multiple sources of data. In our motivating example, a chromatin-immunoprecipitation (ChIP) microarray experiment was conducted to detect binding target genes of a broad transcription regulator, leucine responsive regulatory protein (Lrp) in E. coli. In addition, a cDNA microarray dataset is available to compare gene expression of the wild type with that of a mutant with the Lrp gene deleted in E. coli. It is biologically reasonable to assume that the genes with altered expression are more likely to be regulated by Lrp than those with no expression change. Hence we aim to borrow information in the gene expression data to increase statistical power to detect the binding targets of Lrp. We propose a novel joint model for protein-DNA binding data and gene expression data; under mild modeling assumptions, it is shown that the method is optimal, equivalent to a joint likelihood ratio test. We compare the joint modeling with two existing methods of combining separate analyses. We adopt a nonparametric empirical Bayes (EB) method to draw statistical inference in the joint model; in particular, we propose a new method, maximum likelihood conditional on the binding data, to estimate two prior probabilities for the expression data, which are non-identifiable based on the expression data alone. We use simulated data to demonstrate the improved performance of the joint modeling over other approaches. Application to the Lrp data also shows better performance of the joint modeling than that of analyzing the binding data alone.

AB - With the rapid accumulation of various high-throughput genomic and proteomic data, one is compelled to develop new statistical methods that can take advantage of existing multiple sources of data. In our motivating example, a chromatin-immunoprecipitation (ChIP) microarray experiment was conducted to detect binding target genes of a broad transcription regulator, leucine responsive regulatory protein (Lrp) in E. coli. In addition, a cDNA microarray dataset is available to compare gene expression of the wild type with that of a mutant with the Lrp gene deleted in E. coli. It is biologically reasonable to assume that the genes with altered expression are more likely to be regulated by Lrp than those with no expression change. Hence we aim to borrow information in the gene expression data to increase statistical power to detect the binding targets of Lrp. We propose a novel joint model for protein-DNA binding data and gene expression data; under mild modeling assumptions, it is shown that the method is optimal, equivalent to a joint likelihood ratio test. We compare the joint modeling with two existing methods of combining separate analyses. We adopt a nonparametric empirical Bayes (EB) method to draw statistical inference in the joint model; in particular, we propose a new method, maximum likelihood conditional on the binding data, to estimate two prior probabilities for the expression data, which are non-identifiable based on the expression data alone. We use simulated data to demonstrate the improved performance of the joint modeling over other approaches. Application to the Lrp data also shows better performance of the joint modeling than that of analyzing the binding data alone.

KW - ChIP-chip

KW - Computational biology

KW - False discovery rate

KW - Gene expression

KW - Lrp

KW - Microarray

UR - http://www.scopus.com/inward/record.url?scp=47849132500&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=47849132500&partnerID=8YFLogxK

M3 - Article

VL - 18

SP - 709

EP - 729

JO - Statistica Sinica

JF - Statistica Sinica

SN - 1017-0405

IS - 2

ER -