Representative splitting cross validation

Lu Xu; Ou Hu; Yuwan Guo; Mengqin Zhang; Daowang Lu; Chen Bo Cai; Shunping Xie; Mohammad Goodarzi; Hai Yan Fu; Yuan Bin She

doi:10.1016/j.chemolab.2018.10.008

Representative splitting cross validation

Lu Xu, Ou Hu, Yuwan Guo, Mengqin Zhang, Daowang Lu, Chen Bo Cai, Shunping Xie, Mohammad Goodarzi, Hai Yan Fu, Yuan Bin She

Research output: Contribution to journal › Article › peer-review

18 Scopus citations

Abstract

Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden.

Original language	English (US)
Pages (from-to)	29-35
Number of pages	7
Journal	Chemometrics and Intelligent Laboratory Systems
Volume	183
DOIs	https://doi.org/10.1016/j.chemolab.2018.10.008
State	Published - Dec 15 2018

Keywords

Cross-validation (CV)
Model complexity
Multivariate calibration
Partial least squares (PLS)
Representative splitting cross-validation (RSCV)

ASJC Scopus subject areas

Software
Analytical Chemistry
Process Chemistry and Technology
Spectroscopy
Computer Science Applications

Access to Document

10.1016/j.chemolab.2018.10.008

Cite this

@article{eba3124b603743da87822f7a323e94cd,

title = "Representative splitting cross validation",

abstract = "Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden.",

keywords = "Cross-validation (CV), Model complexity, Multivariate calibration, Partial least squares (PLS), Representative splitting cross-validation (RSCV)",

author = "Lu Xu and Ou Hu and Yuwan Guo and Mengqin Zhang and Daowang Lu and Cai, {Chen Bo} and Shunping Xie and Mohammad Goodarzi and Fu, {Hai Yan} and She, {Yuan Bin}",

note = "Funding Information: Authors are grateful to the financial support from the National Natural Science Foundation of China (Grants Nos. 21665022, 21665002, 21776321, 21576297, 21706233, 21476270 ), Key Projects of Technological Innovation of Hubei Province ( 2016ACA138 ), the Open Research Program (Nos. 2015ZD001, 2015ZD002 and 2015ZY006 ) from the Modernization Engineering Technology Research Center of Ethnic Minority Medicine of Hubei province (South-Central University for Nationalities), and The Talented Youth Cultivation Program from “the Fundamental Research Funds for the Central Universities”, South-Central University for Nationalities (No. CRZ18002 ). Lu Xu is financially supported by Provincial Key Disciplines of Chemical Engineering and Technology in Guizhou Province (No. ZDXK[2017]8 ), Guizhou Engineering Research Center ( QJHKYZ [2017]024 ), Guizhou Provincial Science and Technology Department (No. QKHJC[2017]1186 ), and the Talented Researcher Program from Guizhou Provincial Department of Education ( QJHKYZ[2018]073 ). Funding Information: Authors are grateful to the financial support from the National Natural Science Foundation of China (Grants Nos. 21665022, 21665002, 21776321, 21576297, 21706233, 21476270), Key Projects of Technological Innovation of Hubei Province (2016ACA138), the Open Research Program (Nos. 2015ZD001, 2015ZD002 and 2015ZY006) from the Modernization Engineering Technology Research Center of Ethnic Minority Medicine of Hubei province (South-Central University for Nationalities), and The Talented Youth Cultivation Program from “the Fundamental Research Funds for the Central Universities” South-Central University for Nationalities (No. CRZ18002). Lu Xu is financially supported by Provincial Key Disciplines of Chemical Engineering and Technology in Guizhou Province (No. ZDXK[2017]8), Guizhou Engineering Research Center (QJHKYZ [2017]024), Guizhou Provincial Science and Technology Department (No. QKHJC[2017]1186), and the Talented Researcher Program from Guizhou Provincial Department of Education (QJHKYZ[2018]073). Publisher Copyright: {\textcopyright} 2018 Elsevier B.V.",

year = "2018",

month = dec,

day = "15",

doi = "10.1016/j.chemolab.2018.10.008",

language = "English (US)",

volume = "183",

pages = "29--35",

journal = "Chemometrics and Intelligent Laboratory Systems",

issn = "0169-7439",

publisher = "Elsevier",

}

TY - JOUR

T1 - Representative splitting cross validation

AU - Xu, Lu

AU - Hu, Ou

AU - Guo, Yuwan

AU - Zhang, Mengqin

AU - Lu, Daowang

AU - Cai, Chen Bo

AU - Xie, Shunping

AU - Goodarzi, Mohammad

AU - Fu, Hai Yan

AU - She, Yuan Bin

N1 - Funding Information: Authors are grateful to the financial support from the National Natural Science Foundation of China (Grants Nos. 21665022, 21665002, 21776321, 21576297, 21706233, 21476270 ), Key Projects of Technological Innovation of Hubei Province ( 2016ACA138 ), the Open Research Program (Nos. 2015ZD001, 2015ZD002 and 2015ZY006 ) from the Modernization Engineering Technology Research Center of Ethnic Minority Medicine of Hubei province (South-Central University for Nationalities), and The Talented Youth Cultivation Program from “the Fundamental Research Funds for the Central Universities”, South-Central University for Nationalities (No. CRZ18002 ). Lu Xu is financially supported by Provincial Key Disciplines of Chemical Engineering and Technology in Guizhou Province (No. ZDXK[2017]8 ), Guizhou Engineering Research Center ( QJHKYZ [2017]024 ), Guizhou Provincial Science and Technology Department (No. QKHJC[2017]1186 ), and the Talented Researcher Program from Guizhou Provincial Department of Education ( QJHKYZ[2018]073 ). Funding Information: Authors are grateful to the financial support from the National Natural Science Foundation of China (Grants Nos. 21665022, 21665002, 21776321, 21576297, 21706233, 21476270), Key Projects of Technological Innovation of Hubei Province (2016ACA138), the Open Research Program (Nos. 2015ZD001, 2015ZD002 and 2015ZY006) from the Modernization Engineering Technology Research Center of Ethnic Minority Medicine of Hubei province (South-Central University for Nationalities), and The Talented Youth Cultivation Program from “the Fundamental Research Funds for the Central Universities” South-Central University for Nationalities (No. CRZ18002). Lu Xu is financially supported by Provincial Key Disciplines of Chemical Engineering and Technology in Guizhou Province (No. ZDXK[2017]8), Guizhou Engineering Research Center (QJHKYZ [2017]024), Guizhou Provincial Science and Technology Department (No. QKHJC[2017]1186), and the Talented Researcher Program from Guizhou Provincial Department of Education (QJHKYZ[2018]073). Publisher Copyright: © 2018 Elsevier B.V.

PY - 2018/12/15

Y1 - 2018/12/15

N2 - Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden.

AB - Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden.

KW - Cross-validation (CV)

KW - Model complexity

KW - Multivariate calibration

KW - Partial least squares (PLS)

KW - Representative splitting cross-validation (RSCV)

UR - http://www.scopus.com/inward/record.url?scp=85056180205&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85056180205&partnerID=8YFLogxK

U2 - 10.1016/j.chemolab.2018.10.008

DO - 10.1016/j.chemolab.2018.10.008

M3 - Article

AN - SCOPUS:85056180205

SN - 0169-7439

VL - 183

SP - 29

EP - 35

JO - Chemometrics and Intelligent Laboratory Systems

JF - Chemometrics and Intelligent Laboratory Systems

ER -

Representative splitting cross validation

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this