Predictive big data analytics: A study of Parkinson's disease using large, complex, heterogeneous, incongruent, multi-source and incomplete observations

Ivo D. Dinov, Ben Heavner, Ming Tang, Gustavo Glusman, Kyle Chard, Mike Darcy, Ravi Madduri, Judy Pa, Cathie Spino, Carl Kesselman, Ian Foster, Eric W. Deutsch, Nathan D. Price, John D. Van Horn, Joseph Ames, Kristi Clark, Leroy Hood, Benjamin M. Hampstead, William Dauer, Arthur W. Toga

Research output: Contribution to journalArticle

32 Citations (Scopus)

Abstract

Background: A unique archive of Big Data on Parkinson's Disease is collected, managed and disseminated by the Parkinson's Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson's disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data - large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources-all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data. Methods and Findings: Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson's disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. Conclusions: Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson's disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer's, Huntington's, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.

Original languageEnglish (US)
Article numbere0157077
JournalPloS one
Volume11
Issue number8
DOIs
StatePublished - Aug 2016
Externally publishedYes

Fingerprint

Parkinson disease
Parkinson Disease
artificial intelligence
rating scales
Learning systems
taxonomy
methodology
prediction
demographic statistics
Demography
Neuroimaging
Big data
Adaptive boosting
Information Storage and Retrieval
Amyotrophic Lateral Sclerosis
neurodegenerative diseases
Biomarkers
cerebellum
information sources
Processing

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)

Cite this

Predictive big data analytics : A study of Parkinson's disease using large, complex, heterogeneous, incongruent, multi-source and incomplete observations. / Dinov, Ivo D.; Heavner, Ben; Tang, Ming; Glusman, Gustavo; Chard, Kyle; Darcy, Mike; Madduri, Ravi; Pa, Judy; Spino, Cathie; Kesselman, Carl; Foster, Ian; Deutsch, Eric W.; Price, Nathan D.; Van Horn, John D.; Ames, Joseph; Clark, Kristi; Hood, Leroy; Hampstead, Benjamin M.; Dauer, William; Toga, Arthur W.

In: PloS one, Vol. 11, No. 8, e0157077, 08.2016.

Research output: Contribution to journalArticle

Dinov, ID, Heavner, B, Tang, M, Glusman, G, Chard, K, Darcy, M, Madduri, R, Pa, J, Spino, C, Kesselman, C, Foster, I, Deutsch, EW, Price, ND, Van Horn, JD, Ames, J, Clark, K, Hood, L, Hampstead, BM, Dauer, W & Toga, AW 2016, 'Predictive big data analytics: A study of Parkinson's disease using large, complex, heterogeneous, incongruent, multi-source and incomplete observations', PloS one, vol. 11, no. 8, e0157077. https://doi.org/10.1371/journal.pone.0157077
Dinov, Ivo D. ; Heavner, Ben ; Tang, Ming ; Glusman, Gustavo ; Chard, Kyle ; Darcy, Mike ; Madduri, Ravi ; Pa, Judy ; Spino, Cathie ; Kesselman, Carl ; Foster, Ian ; Deutsch, Eric W. ; Price, Nathan D. ; Van Horn, John D. ; Ames, Joseph ; Clark, Kristi ; Hood, Leroy ; Hampstead, Benjamin M. ; Dauer, William ; Toga, Arthur W. / Predictive big data analytics : A study of Parkinson's disease using large, complex, heterogeneous, incongruent, multi-source and incomplete observations. In: PloS one. 2016 ; Vol. 11, No. 8.
@article{70a9762b9ae84262afba8a1174924c26,
title = "Predictive big data analytics: A study of Parkinson's disease using large, complex, heterogeneous, incongruent, multi-source and incomplete observations",
abstract = "Background: A unique archive of Big Data on Parkinson's Disease is collected, managed and disseminated by the Parkinson's Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson's disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data - large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources-all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data. Methods and Findings: Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson's disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96{\%}, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. Conclusions: Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson's disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80{\%}. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer's, Huntington's, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.",
author = "Dinov, {Ivo D.} and Ben Heavner and Ming Tang and Gustavo Glusman and Kyle Chard and Mike Darcy and Ravi Madduri and Judy Pa and Cathie Spino and Carl Kesselman and Ian Foster and Deutsch, {Eric W.} and Price, {Nathan D.} and {Van Horn}, {John D.} and Joseph Ames and Kristi Clark and Leroy Hood and Hampstead, {Benjamin M.} and William Dauer and Toga, {Arthur W.}",
year = "2016",
month = "8",
doi = "10.1371/journal.pone.0157077",
language = "English (US)",
volume = "11",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "8",

}

TY - JOUR

T1 - Predictive big data analytics

T2 - A study of Parkinson's disease using large, complex, heterogeneous, incongruent, multi-source and incomplete observations

AU - Dinov, Ivo D.

AU - Heavner, Ben

AU - Tang, Ming

AU - Glusman, Gustavo

AU - Chard, Kyle

AU - Darcy, Mike

AU - Madduri, Ravi

AU - Pa, Judy

AU - Spino, Cathie

AU - Kesselman, Carl

AU - Foster, Ian

AU - Deutsch, Eric W.

AU - Price, Nathan D.

AU - Van Horn, John D.

AU - Ames, Joseph

AU - Clark, Kristi

AU - Hood, Leroy

AU - Hampstead, Benjamin M.

AU - Dauer, William

AU - Toga, Arthur W.

PY - 2016/8

Y1 - 2016/8

N2 - Background: A unique archive of Big Data on Parkinson's Disease is collected, managed and disseminated by the Parkinson's Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson's disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data - large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources-all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data. Methods and Findings: Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson's disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. Conclusions: Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson's disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer's, Huntington's, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.

AB - Background: A unique archive of Big Data on Parkinson's Disease is collected, managed and disseminated by the Parkinson's Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson's disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data - large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources-all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data. Methods and Findings: Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson's disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. Conclusions: Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson's disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer's, Huntington's, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.

UR - http://www.scopus.com/inward/record.url?scp=84983464569&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84983464569&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0157077

DO - 10.1371/journal.pone.0157077

M3 - Article

C2 - 27494614

AN - SCOPUS:84983464569

VL - 11

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 8

M1 - e0157077

ER -