Representative splitting cross validation

Lu Xu, Ou Hu, Yuwan Guo, Mengqin Zhang, Daowang Lu, Chen Bo Cai, Shunping Xie, Mohammad Goodarzi, Hai Yan Fu, Yuan Bin She

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden.

Original languageEnglish (US)
Pages (from-to)29-35
Number of pages7
JournalChemometrics and Intelligent Laboratory Systems
Volume183
DOIs
StatePublished - Dec 15 2018

Fingerprint

Calibration

Keywords

  • Cross-validation (CV)
  • Model complexity
  • Multivariate calibration
  • Partial least squares (PLS)
  • Representative splitting cross-validation (RSCV)

ASJC Scopus subject areas

  • Analytical Chemistry
  • Software
  • Process Chemistry and Technology
  • Spectroscopy
  • Computer Science Applications

Cite this

Xu, L., Hu, O., Guo, Y., Zhang, M., Lu, D., Cai, C. B., ... She, Y. B. (2018). Representative splitting cross validation. Chemometrics and Intelligent Laboratory Systems, 183, 29-35. https://doi.org/10.1016/j.chemolab.2018.10.008

Representative splitting cross validation. / Xu, Lu; Hu, Ou; Guo, Yuwan; Zhang, Mengqin; Lu, Daowang; Cai, Chen Bo; Xie, Shunping; Goodarzi, Mohammad; Fu, Hai Yan; She, Yuan Bin.

In: Chemometrics and Intelligent Laboratory Systems, Vol. 183, 15.12.2018, p. 29-35.

Research output: Contribution to journalArticle

Xu, L, Hu, O, Guo, Y, Zhang, M, Lu, D, Cai, CB, Xie, S, Goodarzi, M, Fu, HY & She, YB 2018, 'Representative splitting cross validation', Chemometrics and Intelligent Laboratory Systems, vol. 183, pp. 29-35. https://doi.org/10.1016/j.chemolab.2018.10.008
Xu, Lu ; Hu, Ou ; Guo, Yuwan ; Zhang, Mengqin ; Lu, Daowang ; Cai, Chen Bo ; Xie, Shunping ; Goodarzi, Mohammad ; Fu, Hai Yan ; She, Yuan Bin. / Representative splitting cross validation. In: Chemometrics and Intelligent Laboratory Systems. 2018 ; Vol. 183. pp. 29-35.
@article{eba3124b603743da87822f7a323e94cd,
title = "Representative splitting cross validation",
abstract = "Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden.",
keywords = "Cross-validation (CV), Model complexity, Multivariate calibration, Partial least squares (PLS), Representative splitting cross-validation (RSCV)",
author = "Lu Xu and Ou Hu and Yuwan Guo and Mengqin Zhang and Daowang Lu and Cai, {Chen Bo} and Shunping Xie and Mohammad Goodarzi and Fu, {Hai Yan} and She, {Yuan Bin}",
year = "2018",
month = "12",
day = "15",
doi = "10.1016/j.chemolab.2018.10.008",
language = "English (US)",
volume = "183",
pages = "29--35",
journal = "Chemometrics and Intelligent Laboratory Systems",
issn = "0169-7439",
publisher = "Elsevier",

}

TY - JOUR

T1 - Representative splitting cross validation

AU - Xu, Lu

AU - Hu, Ou

AU - Guo, Yuwan

AU - Zhang, Mengqin

AU - Lu, Daowang

AU - Cai, Chen Bo

AU - Xie, Shunping

AU - Goodarzi, Mohammad

AU - Fu, Hai Yan

AU - She, Yuan Bin

PY - 2018/12/15

Y1 - 2018/12/15

N2 - Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden.

AB - Cross-validation (CV) is widely used to estimate model complexity or the number of significant latent variables (LVs) for multivariate calibration methods like partial least squares (PLS). A basic consideration when developing and validating multivariate calibration models is that both the training and validation sets should be representative and distributed in the experimental space as uniformly as possible. Motivated by this idea, we proposed a new CV method called representative splitting cross-validation (RSCV). In RSCV, firstly, the DUPLEX algorithm was used to sequentially divide the original training set into k (in this work, k = 2, 4, 8 and 16) equal parts. Secondly, a series of k-fold (k = 2, 4, 8 and 16) CVs were performed based on the above data splitting. Finally, the pooled root mean squared error of CV (RMSECV) was used to estimate model complexity. Five real multivariate calibration data sets were investigated and RSCV was compared with leave-one-out CV (LOOCV), 10-fold CV and Monte Carlo CV (MCCV). With a maximum k of 16, RSCV was shown to be a useful and stable method to select PLS LVs, and can obtain simpler models with acceptable computational burden.

KW - Cross-validation (CV)

KW - Model complexity

KW - Multivariate calibration

KW - Partial least squares (PLS)

KW - Representative splitting cross-validation (RSCV)

UR - http://www.scopus.com/inward/record.url?scp=85056180205&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85056180205&partnerID=8YFLogxK

U2 - 10.1016/j.chemolab.2018.10.008

DO - 10.1016/j.chemolab.2018.10.008

M3 - Article

AN - SCOPUS:85056180205

VL - 183

SP - 29

EP - 35

JO - Chemometrics and Intelligent Laboratory Systems

JF - Chemometrics and Intelligent Laboratory Systems

SN - 0169-7439

ER -