Drug-likeness analysis of traditional chinese medicines: Prediction of drug-likeness using machine learning approaches

Sheng Tian, Junmei Wang, Youyong Li, Xiaojie Xu, Tingjun Hou

Research output: Contribution to journalArticle

50 Citations (Scopus)

Abstract

Quantitative or qualitative characterization of the drug-like features of known drugs may help medicinal and computational chemists to select higher quality drug leads from a huge pool of compounds and to improve the efficiency of drug design pipelines. For this purpose, the theoretical models for drug-likeness to discriminate between drug-like and non-drug-like based on molecular physicochemical properties and structural fingerprints were developed by using the naive Bayesian classification (NBC) and recursive partitioning (RP) techniques, and then the drug-likeness of the compounds from the Traditional Chinese Medicine Compound Database (TCMCD) was evaluated. First, the impact of molecular physicochemical properties and structural fingerprints on the prediction accuracy of drug-likeness was examined. We found that, compared with simple molecular properties, structural fingerprints were more essential for the accurate prediction of drug-likeness. Then, a variety of Bayesian classifiers were constructed by changing the ratio of drug-like to non-drug-like molecules and the size of the training set. The results indicate that the prediction accuracy of the Bayesian classifiers was closely related to the size and the degree of the balance of the training set. When a balanced training set was used, the best Bayesian classifier based on 21 physicochemical properties and the LCFP-6 fingerprint set yielded an overall leave-one-out (LOO) cross-validated accuracy of 91.4% for the 140,000 molecules in the training set and 90.9% for the 40,000 molecules in the test set. In addition, the RP classifiers with different maximum depth were constructed and compared with the Bayesian classifiers, and we found that the best Bayesian classifier outperformed the best RP model with respect to overall prediction accuracy. Moreover, the Bayesian classifier employing structural fingerprints highlights the important substructures favorable or unfavorable for drug-likeness, offering extra valuable information for getting high quality lead compounds in the early stage of the drug design/discovery process. Finally, the best Bayesian classifier was used to predict the drug-likeness of 33,961 compounds in TCMCD. Our calculations show that 59.37% of the molecules in TCMCD were identified as drug-like molecules, indicating that traditional Chinese medicines (TCMs) are therefore an excellent source of drug-like molecules. Furthermore, the important structural fingerprints in TCMCD were detected and analyzed. Considering that the pharmacology of TCMCD and MDDR (MDL Drug Data Report) was linked by the important common structural features, the potential pharmacology of the compounds in TCMCD may therefore be annotated by these important structural signatures identified from Bayesian analysis, which may be valuable to promote the development of TCMs.

Original languageEnglish (US)
Pages (from-to)2875-2886
Number of pages12
JournalMolecular Pharmaceutics
Volume9
Issue number10
DOIs
StatePublished - 2012

Fingerprint

Chinese Traditional Medicine
Dermatoglyphics
Pharmaceutical Preparations
Databases
Bayes Theorem
Drug Design
Machine Learning
Pharmaceutical Databases
Pharmacology
Drug Discovery
Theoretical Models

ASJC Scopus subject areas

  • Pharmaceutical Science
  • Molecular Medicine
  • Drug Discovery

Cite this

Drug-likeness analysis of traditional chinese medicines : Prediction of drug-likeness using machine learning approaches. / Tian, Sheng; Wang, Junmei; Li, Youyong; Xu, Xiaojie; Hou, Tingjun.

In: Molecular Pharmaceutics, Vol. 9, No. 10, 2012, p. 2875-2886.

Research output: Contribution to journalArticle

@article{08245cc687214fd3a6b4dfd6752810e8,
title = "Drug-likeness analysis of traditional chinese medicines: Prediction of drug-likeness using machine learning approaches",
abstract = "Quantitative or qualitative characterization of the drug-like features of known drugs may help medicinal and computational chemists to select higher quality drug leads from a huge pool of compounds and to improve the efficiency of drug design pipelines. For this purpose, the theoretical models for drug-likeness to discriminate between drug-like and non-drug-like based on molecular physicochemical properties and structural fingerprints were developed by using the naive Bayesian classification (NBC) and recursive partitioning (RP) techniques, and then the drug-likeness of the compounds from the Traditional Chinese Medicine Compound Database (TCMCD) was evaluated. First, the impact of molecular physicochemical properties and structural fingerprints on the prediction accuracy of drug-likeness was examined. We found that, compared with simple molecular properties, structural fingerprints were more essential for the accurate prediction of drug-likeness. Then, a variety of Bayesian classifiers were constructed by changing the ratio of drug-like to non-drug-like molecules and the size of the training set. The results indicate that the prediction accuracy of the Bayesian classifiers was closely related to the size and the degree of the balance of the training set. When a balanced training set was used, the best Bayesian classifier based on 21 physicochemical properties and the LCFP-6 fingerprint set yielded an overall leave-one-out (LOO) cross-validated accuracy of 91.4{\%} for the 140,000 molecules in the training set and 90.9{\%} for the 40,000 molecules in the test set. In addition, the RP classifiers with different maximum depth were constructed and compared with the Bayesian classifiers, and we found that the best Bayesian classifier outperformed the best RP model with respect to overall prediction accuracy. Moreover, the Bayesian classifier employing structural fingerprints highlights the important substructures favorable or unfavorable for drug-likeness, offering extra valuable information for getting high quality lead compounds in the early stage of the drug design/discovery process. Finally, the best Bayesian classifier was used to predict the drug-likeness of 33,961 compounds in TCMCD. Our calculations show that 59.37{\%} of the molecules in TCMCD were identified as drug-like molecules, indicating that traditional Chinese medicines (TCMs) are therefore an excellent source of drug-like molecules. Furthermore, the important structural fingerprints in TCMCD were detected and analyzed. Considering that the pharmacology of TCMCD and MDDR (MDL Drug Data Report) was linked by the important common structural features, the potential pharmacology of the compounds in TCMCD may therefore be annotated by these important structural signatures identified from Bayesian analysis, which may be valuable to promote the development of TCMs.",
author = "Sheng Tian and Junmei Wang and Youyong Li and Xiaojie Xu and Tingjun Hou",
year = "2012",
doi = "10.1021/mp300198d",
language = "English (US)",
volume = "9",
pages = "2875--2886",
journal = "Molecular Pharmaceutics",
issn = "1543-8384",
publisher = "American Chemical Society",
number = "10",

}

TY - JOUR

T1 - Drug-likeness analysis of traditional chinese medicines

T2 - Prediction of drug-likeness using machine learning approaches

AU - Tian, Sheng

AU - Wang, Junmei

AU - Li, Youyong

AU - Xu, Xiaojie

AU - Hou, Tingjun

PY - 2012

Y1 - 2012

N2 - Quantitative or qualitative characterization of the drug-like features of known drugs may help medicinal and computational chemists to select higher quality drug leads from a huge pool of compounds and to improve the efficiency of drug design pipelines. For this purpose, the theoretical models for drug-likeness to discriminate between drug-like and non-drug-like based on molecular physicochemical properties and structural fingerprints were developed by using the naive Bayesian classification (NBC) and recursive partitioning (RP) techniques, and then the drug-likeness of the compounds from the Traditional Chinese Medicine Compound Database (TCMCD) was evaluated. First, the impact of molecular physicochemical properties and structural fingerprints on the prediction accuracy of drug-likeness was examined. We found that, compared with simple molecular properties, structural fingerprints were more essential for the accurate prediction of drug-likeness. Then, a variety of Bayesian classifiers were constructed by changing the ratio of drug-like to non-drug-like molecules and the size of the training set. The results indicate that the prediction accuracy of the Bayesian classifiers was closely related to the size and the degree of the balance of the training set. When a balanced training set was used, the best Bayesian classifier based on 21 physicochemical properties and the LCFP-6 fingerprint set yielded an overall leave-one-out (LOO) cross-validated accuracy of 91.4% for the 140,000 molecules in the training set and 90.9% for the 40,000 molecules in the test set. In addition, the RP classifiers with different maximum depth were constructed and compared with the Bayesian classifiers, and we found that the best Bayesian classifier outperformed the best RP model with respect to overall prediction accuracy. Moreover, the Bayesian classifier employing structural fingerprints highlights the important substructures favorable or unfavorable for drug-likeness, offering extra valuable information for getting high quality lead compounds in the early stage of the drug design/discovery process. Finally, the best Bayesian classifier was used to predict the drug-likeness of 33,961 compounds in TCMCD. Our calculations show that 59.37% of the molecules in TCMCD were identified as drug-like molecules, indicating that traditional Chinese medicines (TCMs) are therefore an excellent source of drug-like molecules. Furthermore, the important structural fingerprints in TCMCD were detected and analyzed. Considering that the pharmacology of TCMCD and MDDR (MDL Drug Data Report) was linked by the important common structural features, the potential pharmacology of the compounds in TCMCD may therefore be annotated by these important structural signatures identified from Bayesian analysis, which may be valuable to promote the development of TCMs.

AB - Quantitative or qualitative characterization of the drug-like features of known drugs may help medicinal and computational chemists to select higher quality drug leads from a huge pool of compounds and to improve the efficiency of drug design pipelines. For this purpose, the theoretical models for drug-likeness to discriminate between drug-like and non-drug-like based on molecular physicochemical properties and structural fingerprints were developed by using the naive Bayesian classification (NBC) and recursive partitioning (RP) techniques, and then the drug-likeness of the compounds from the Traditional Chinese Medicine Compound Database (TCMCD) was evaluated. First, the impact of molecular physicochemical properties and structural fingerprints on the prediction accuracy of drug-likeness was examined. We found that, compared with simple molecular properties, structural fingerprints were more essential for the accurate prediction of drug-likeness. Then, a variety of Bayesian classifiers were constructed by changing the ratio of drug-like to non-drug-like molecules and the size of the training set. The results indicate that the prediction accuracy of the Bayesian classifiers was closely related to the size and the degree of the balance of the training set. When a balanced training set was used, the best Bayesian classifier based on 21 physicochemical properties and the LCFP-6 fingerprint set yielded an overall leave-one-out (LOO) cross-validated accuracy of 91.4% for the 140,000 molecules in the training set and 90.9% for the 40,000 molecules in the test set. In addition, the RP classifiers with different maximum depth were constructed and compared with the Bayesian classifiers, and we found that the best Bayesian classifier outperformed the best RP model with respect to overall prediction accuracy. Moreover, the Bayesian classifier employing structural fingerprints highlights the important substructures favorable or unfavorable for drug-likeness, offering extra valuable information for getting high quality lead compounds in the early stage of the drug design/discovery process. Finally, the best Bayesian classifier was used to predict the drug-likeness of 33,961 compounds in TCMCD. Our calculations show that 59.37% of the molecules in TCMCD were identified as drug-like molecules, indicating that traditional Chinese medicines (TCMs) are therefore an excellent source of drug-like molecules. Furthermore, the important structural fingerprints in TCMCD were detected and analyzed. Considering that the pharmacology of TCMCD and MDDR (MDL Drug Data Report) was linked by the important common structural features, the potential pharmacology of the compounds in TCMCD may therefore be annotated by these important structural signatures identified from Bayesian analysis, which may be valuable to promote the development of TCMs.

UR - http://www.scopus.com/inward/record.url?scp=84870177359&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84870177359&partnerID=8YFLogxK

U2 - 10.1021/mp300198d

DO - 10.1021/mp300198d

M3 - Article

C2 - 22738405

AN - SCOPUS:84870177359

VL - 9

SP - 2875

EP - 2886

JO - Molecular Pharmaceutics

JF - Molecular Pharmaceutics

SN - 1543-8384

IS - 10

ER -