TY - JOUR
T1 - A two-stage approach of gene network analysis for high-dimensional heterogeneous data
AU - Lee, Sangin
AU - Liang, Faming
AU - Cai, Ling
AU - Xiao, Guanghua
N1 - Funding Information:
National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2017R1C1B2010113 to SL), the National Science Foundation grant (DMS-1612924 to FL), American Association for Cancer Research (AACR) Basic Cancer Research Fellowship (Award Number: 15-40-01-CAIL to LC) and the National Institutes of Health grants (1R01CA17221 to GX, 1R01GM117597 to FL).
Publisher Copyright:
© The Author 2017. Published by Oxford University Press. All rights reserved.
PY - 2018/4/1
Y1 - 2018/4/1
N2 - Gaussian graphical models have been widely used to construct gene regulatory networks from gene expression data. Most existing methods for Gaussian graphical models are designed to model homogeneous data, assuming a single Gaussian distribution. In practice, however, data may consist of gene expression studies with unknown confounding factors, such as study cohort, microarray platforms, experimental batches, which produce heterogeneous data, and hence lead to false positive edges or low detection power in resulting network, due to those unknown factors. To overcome this problem and improve the performance in constructing gene networks, we propose a two-stage approach to construct a gene network from heterogeneous data. The first stage is to perform a clustering analysis in order to assign samples to a few clusters where the samples in each cluster are approximately homogeneous, and the second stage is to conduct an integrative analysis of networks from each cluster. In particular, we first apply a model-based clustering method using the singular value decomposition for high-dimensional data, and then integrate the networks from each cluster using the integrative ?-learning method. The proposed method is based on an equivalent measure of partial correlation coefficients in Gaussian graphical models, which is computed with a reduced conditional set and thus it is useful for high-dimensional data.We compare the proposed two-stage learning approach with some existing methods in various simulation settings, and demonstrate the robustness of the proposed method. Finally, it is applied to integrate multiple gene expression studies of lung adenocarcinoma to identify potential therapeutic targets and treatment biomarkers.
AB - Gaussian graphical models have been widely used to construct gene regulatory networks from gene expression data. Most existing methods for Gaussian graphical models are designed to model homogeneous data, assuming a single Gaussian distribution. In practice, however, data may consist of gene expression studies with unknown confounding factors, such as study cohort, microarray platforms, experimental batches, which produce heterogeneous data, and hence lead to false positive edges or low detection power in resulting network, due to those unknown factors. To overcome this problem and improve the performance in constructing gene networks, we propose a two-stage approach to construct a gene network from heterogeneous data. The first stage is to perform a clustering analysis in order to assign samples to a few clusters where the samples in each cluster are approximately homogeneous, and the second stage is to conduct an integrative analysis of networks from each cluster. In particular, we first apply a model-based clustering method using the singular value decomposition for high-dimensional data, and then integrate the networks from each cluster using the integrative ?-learning method. The proposed method is based on an equivalent measure of partial correlation coefficients in Gaussian graphical models, which is computed with a reduced conditional set and thus it is useful for high-dimensional data.We compare the proposed two-stage learning approach with some existing methods in various simulation settings, and demonstrate the robustness of the proposed method. Finally, it is applied to integrate multiple gene expression studies of lung adenocarcinoma to identify potential therapeutic targets and treatment biomarkers.
KW - Gaussian graphical model
KW - Gene regulatory network
KW - Integrative analysis
KW - Model-based clustering
KW - Partial correlation coefficient
KW - Two-stage approach
UR - http://www.scopus.com/inward/record.url?scp=85044768648&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85044768648&partnerID=8YFLogxK
U2 - 10.1093/biostatistics/kxx033
DO - 10.1093/biostatistics/kxx033
M3 - Article
C2 - 29036516
AN - SCOPUS:85044768648
SN - 1465-4644
VL - 19
SP - 216
EP - 232
JO - Biostatistics
JF - Biostatistics
IS - 2
ER -