DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine

Abdul Wahab; Hilal Tayara; Zhenyu Xuan; Kil To Chong

doi:10.1038/s41598-020-80430-x

DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine

Abdul Wahab, Hilal Tayara, Zhenyu Xuan, Kil To Chong

Research output: Contribution to journal › Article › peer-review

23 Scopus citations

Abstract

N4-methylcytosine is a biochemical alteration of DNA that affects the genetic operations without modifying the DNA nucleotides such as gene expression, genomic imprinting, chromosome stability, and the development of the cell. In the proposed work, a computational model, 4mCNLP-Deep, used the word embedding approach as a vector formulation by exploiting deep learning based CNN algorithm to predict 4mC and non-4mC sites on the C.elegans genome dataset. Diversity of ranges employed for the experimental such as corpus k-mer and k-fold cross-validation to obtain the prevailing capabilities. The 4mCNLP-Deep outperform from the state-of-the-art predictor by achieving the results in five evaluation metrics by following; Accuracy (ACC) as 0.9354, Mathew’s correlation coefficient (MCC) as 0.8608, Specificity (Sp) as 0.89.96, Sensitivity (Sn) as 0.9563, and Area under curve (AUC) as 0.9731 by using 3-mer corpus word2vec and 3-fold cross-validation and attained the increment of 1.1%, 0.6%, 0.58%, 0.77%, and 4.89%, respectively. At last, we developed the online webserver http://nsclbio.jbnu.ac.kr/tools/4mCNLP-Deep/, for the experimental researchers to get the results easily.

Original language	English (US)
Article number	212
Journal	Scientific reports
Volume	11
Issue number	1
DOIs	https://doi.org/10.1038/s41598-020-80430-x
State	Published - Dec 2021

ASJC Scopus subject areas

General

Access to Document

10.1038/s41598-020-80430-x

Cite this

@article{25261ed0c1b242349ba0cafee476842e,

title = "DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine",

abstract = "N4-methylcytosine is a biochemical alteration of DNA that affects the genetic operations without modifying the DNA nucleotides such as gene expression, genomic imprinting, chromosome stability, and the development of the cell. In the proposed work, a computational model, 4mCNLP-Deep, used the word embedding approach as a vector formulation by exploiting deep learning based CNN algorithm to predict 4mC and non-4mC sites on the C.elegans genome dataset. Diversity of ranges employed for the experimental such as corpus k-mer and k-fold cross-validation to obtain the prevailing capabilities. The 4mCNLP-Deep outperform from the state-of-the-art predictor by achieving the results in five evaluation metrics by following; Accuracy (ACC) as 0.9354, Mathew{\textquoteright}s correlation coefficient (MCC) as 0.8608, Specificity (Sp) as 0.89.96, Sensitivity (Sn) as 0.9563, and Area under curve (AUC) as 0.9731 by using 3-mer corpus word2vec and 3-fold cross-validation and attained the increment of 1.1%, 0.6%, 0.58%, 0.77%, and 4.89%, respectively. At last, we developed the online webserver http://nsclbio.jbnu.ac.kr/tools/4mCNLP-Deep/, for the experimental researchers to get the results easily.",

author = "Abdul Wahab and Hilal Tayara and Zhenyu Xuan and Chong, {Kil To}",

note = "Funding Information: This work was supported in part by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2020R1A2C2005612), in part by the Brain Research Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. NRF-2017M3C7A1044816), and in part by Human Resources Program in Energy Technology” of the Korea Institute of Energy Technology Evaluation and Planning (KETEP), granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea (No. 20204010600470). Publisher Copyright: {\textcopyright} 2021, The Author(s).",

year = "2021",

month = dec,

doi = "10.1038/s41598-020-80430-x",

language = "English (US)",

volume = "11",

journal = "Scientific reports",

issn = "2045-2322",

publisher = "Nature Publishing Group",

number = "1",

}

TY - JOUR

T1 - DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine

AU - Wahab, Abdul

AU - Tayara, Hilal

AU - Xuan, Zhenyu

AU - Chong, Kil To

N1 - Funding Information: This work was supported in part by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2020R1A2C2005612), in part by the Brain Research Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. NRF-2017M3C7A1044816), and in part by Human Resources Program in Energy Technology” of the Korea Institute of Energy Technology Evaluation and Planning (KETEP), granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea (No. 20204010600470). Publisher Copyright: © 2021, The Author(s).

PY - 2021/12

Y1 - 2021/12

N2 - N4-methylcytosine is a biochemical alteration of DNA that affects the genetic operations without modifying the DNA nucleotides such as gene expression, genomic imprinting, chromosome stability, and the development of the cell. In the proposed work, a computational model, 4mCNLP-Deep, used the word embedding approach as a vector formulation by exploiting deep learning based CNN algorithm to predict 4mC and non-4mC sites on the C.elegans genome dataset. Diversity of ranges employed for the experimental such as corpus k-mer and k-fold cross-validation to obtain the prevailing capabilities. The 4mCNLP-Deep outperform from the state-of-the-art predictor by achieving the results in five evaluation metrics by following; Accuracy (ACC) as 0.9354, Mathew’s correlation coefficient (MCC) as 0.8608, Specificity (Sp) as 0.89.96, Sensitivity (Sn) as 0.9563, and Area under curve (AUC) as 0.9731 by using 3-mer corpus word2vec and 3-fold cross-validation and attained the increment of 1.1%, 0.6%, 0.58%, 0.77%, and 4.89%, respectively. At last, we developed the online webserver http://nsclbio.jbnu.ac.kr/tools/4mCNLP-Deep/, for the experimental researchers to get the results easily.

AB - N4-methylcytosine is a biochemical alteration of DNA that affects the genetic operations without modifying the DNA nucleotides such as gene expression, genomic imprinting, chromosome stability, and the development of the cell. In the proposed work, a computational model, 4mCNLP-Deep, used the word embedding approach as a vector formulation by exploiting deep learning based CNN algorithm to predict 4mC and non-4mC sites on the C.elegans genome dataset. Diversity of ranges employed for the experimental such as corpus k-mer and k-fold cross-validation to obtain the prevailing capabilities. The 4mCNLP-Deep outperform from the state-of-the-art predictor by achieving the results in five evaluation metrics by following; Accuracy (ACC) as 0.9354, Mathew’s correlation coefficient (MCC) as 0.8608, Specificity (Sp) as 0.89.96, Sensitivity (Sn) as 0.9563, and Area under curve (AUC) as 0.9731 by using 3-mer corpus word2vec and 3-fold cross-validation and attained the increment of 1.1%, 0.6%, 0.58%, 0.77%, and 4.89%, respectively. At last, we developed the online webserver http://nsclbio.jbnu.ac.kr/tools/4mCNLP-Deep/, for the experimental researchers to get the results easily.

UR - http://www.scopus.com/inward/record.url?scp=85098959879&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85098959879&partnerID=8YFLogxK

U2 - 10.1038/s41598-020-80430-x

DO - 10.1038/s41598-020-80430-x

M3 - Article

C2 - 33420191

AN - SCOPUS:85098959879

SN - 2045-2322

VL - 11

JO - Scientific reports

JF - Scientific reports

IS - 1

M1 - 212

ER -

DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this